RAG Systems in Production: What the Tutorials Don't Cover

RAG is architecturally simple: chunk documents, embed them, store in a vector DB, retrieve the top-k on query, pass retrieved context to an LLM, return answer. The demo takes an afternoon. The production system takes months, because “works on the demo documents” is nowhere near “answers correctly 95% of the time across the full document corpus.”

This post is about the gap between those two states.

The Architecture You Start With

User query
    │
    ▼
Embedding model  ──────────────▶  Vector DB (pgvector / Qdrant / Weaviate)
    │                                     │
    │                              top-k similar chunks
    │                                     │
    ▼                                     ▼
                    LLM prompt:
                    ┌─────────────────────────────────┐
                    │ Context: {retrieved chunks}      │
                    │                                  │
                    │ Question: {user query}           │
                    │                                  │
                    │ Answer based on context only:    │
                    └─────────────────────────────────┘
                              │
                              ▼
                           Answer

This works. For a narrow, well-curated corpus with clear questions, it works quite well. Here’s where it breaks.

Failure Mode 1: Chunking Strategy

The most impactful decision in the entire pipeline, least covered in tutorials.

Fixed-size chunking (split every N tokens) is the default everywhere. It’s also wrong for most document types, because it ignores document structure. A 512-token chunk that cuts through a table, splits a code example in half, or separates a question from its answer is worse than useless — it provides partial context that misleads the LLM.

Structure-aware chunking respects the document’s natural boundaries:

Document type	Better chunking strategy
Markdown docs	Split by `#`/`##` headings
API documentation	Split by endpoint/method
Code files	Split by function/class
PDFs (mixed)	Split by paragraph, preserve tables as single chunks
Q&A documents	Keep question + answer together always
Long narratives	Hierarchical: sections as parent, paragraphs as children

We moved from fixed 512-token chunks to heading-based chunking for our internal documentation corpus and saw answer quality (as measured by our evaluation set) improve from 61% to 78% without any other changes.

Chunk overlap helps with context continuity but increases index size. We settled on 15% overlap after testing 0%, 10%, 20%, 30% — beyond 20% the retrieval quality didn’t improve but the index grew by 25%.

Failure Mode 2: Retrieval Quality

Cosine similarity on embeddings is not a perfect relevance signal. Top-k retrieval by embedding similarity returns “semantically similar” chunks, which is not the same as “the most useful context for this question.”

Problems:

Semantic similarity ≠ answer relevance: a question about “how to cancel a trade” may retrieve chunks about “trade confirmation” (high semantic similarity) rather than “trade cancellation API” (exact answer)
Short queries lose to long documents: embedding models compress longer text into the same fixed-dimension space as short text; nuance is lost asymmetrically
Recent documents may rank lower than older ones: the embedding captures meaning, not recency

Improvements that worked:

Hybrid search — combine dense vector search with sparse BM25 keyword search, then merge with Reciprocal Rank Fusion (RRF). Dense search handles semantic similarity; sparse search handles exact keyword matching. For technical documentation where users query by exact API names, the sparse component was critical.

Vector similarity score (dense)  ──┐
                                    ├── RRF merge → final ranking
BM25 keyword score (sparse)      ──┘

Reranking — retrieve top-20 by embedding similarity, then apply a cross-encoder reranker model to rerank those 20 by relevance to the query. Cross-encoders are more accurate than bi-encoders (which is what embedding models are) because they process the query and document together rather than independently. Cost: 20 cross-encoder inferences per query. Worth it for queries where the top-3 chunks matter most.

Retrieval method	Recall@5	Precision@5	Latency
Dense only (top-5)	0.61	0.54	45ms
Hybrid (dense+BM25)	0.74	0.66	52ms
Hybrid + reranker	0.81	0.77	180ms

The hybrid + reranker pipeline added 135ms of latency for a 27 percentage point improvement in precision. For a document Q&A system where answer quality matters more than latency, the trade-off was correct.

Failure Mode 3: Evaluation

“It seems to work” is not evaluation. Without a systematic evaluation suite, you can’t measure whether changes improve or regress answer quality, and you can’t detect the (frequent) cases where changes that improve one class of questions hurt another class.

A minimal evaluation setup:

A ground truth set — 100–200 question/answer pairs where you know the correct answer and which document it comes from. Human-curated, not LLM-generated (LLM-generated evals have systematic gaps matching LLM failure modes).
Retrieval metrics — for each question, did the correct document appear in top-k? (recall@k, precision@k)
Answer quality metrics — for retrieved questions, is the LLM answer correct? Measuring this is hard; pragmatic options:
- Exact match / substring match for factual questions
- LLM-as-judge (ask a separate, stronger LLM to score answer vs reference) — imperfect but scalable
- Human review for a sample (expensive but ground truth)

We ran the evaluation suite on every change to the pipeline (chunking strategy, embedding model, retrieval parameters, reranker, prompt). It ran in ~3 minutes and produced a quality score that became the gate for merging changes.

Without the evaluation suite, every pipeline “improvement” was a guess.

Failure Mode 4: Context Stuffing and LLM Confusion

More retrieved context is not always better. LLMs have a known weakness: when given a long context containing both relevant and irrelevant information, they sometimes attend to the irrelevant parts and produce a wrong or hallucinated answer.

The “lost in the middle” problem (Liu et al., 2023): LLMs are better at using context at the beginning and end of the prompt than in the middle. Retrieving 10 chunks and concatenating them means chunks 4–7 are in the danger zone.

What helped:

Retrieve fewer, better chunks — 3–5 high-quality chunks beat 10 medium-quality ones
Rerank before truncating — use reranking to select the best 3 from 20 candidates rather than naively taking the top 3 from vector search
Position the most relevant chunk first — if you can identify the single most relevant chunk (reranker score), put it first in the prompt

The Production Architecture (After All That)

User query
    │
    ├──▶ Query expansion (LLM: generate 3 reformulations)
    │           │
    │    Multiple query variants
    │           │
    ▼           ▼
  Embedding   BM25 index
  (dense)     (sparse)
    │               │
    └───── RRF ─────┘
               │
           top-20 candidates
               │
           Cross-encoder reranker
               │
           top-4 chunks
               │
    ┌──────────▼──────────────────┐
    │ System: You are...           │
    │                              │
    │ Context:                     │
    │   [chunk 1 - most relevant]  │
    │   [chunk 2]                  │
    │   [chunk 3]                  │
    │   [chunk 4]                  │
    │                              │
    │ Question: {query}            │
    │ Answer only from context.    │
    │ If unsure, say so.           │
    └──────────────────────────────┘
               │
            LLM response
               │
    Evaluation harness (async, logs to eval DB)

Query expansion — generating 3 reformulations of the user’s query with an LLM and running all 4 through retrieval, then merging results — improved recall@5 from 0.81 to 0.88 at the cost of ~150ms additional latency (parallel LLM call). For our use case: worth it.

The evaluation harness runs asynchronously and logs every query, retrieved chunks, and LLM response to a database. This powers the evaluation suite and lets us detect quality regressions in production before users notice them systematically.

Go Implementation Notes

We built this in Go. The ecosystem is thinner than Python’s, but adequate:

Embeddings: direct HTTP calls to the embedding API (OpenAI, Cohere, or self-hosted) — no library needed, just a struct and json.Marshal
Vector DB: pgvector Postgres extension via pgx/v5 — fewer operational dependencies than a dedicated vector DB, fine at our scale (~5M chunks)
BM25: blevesearch/bleve for the sparse index, or tantivy via CGo if you need more performance
Reranker: HTTP call to a self-hosted Cohere reranker or a custom cross-encoder served with onnxruntime
LLM calls: direct HTTP to OpenAI/Anthropic/Bedrock APIs — the anthropic-sdk-go and openai-go packages are fine

The Python ecosystem has more pre-built components (LangChain, LlamaIndex), but they’re often opaque about what they’re doing, which makes debugging retrieval quality harder. When something goes wrong, you want to see the exact chunks being passed to the LLM, the exact scores from each retrieval stage, and the exact prompt. Explicit Go code makes this visible; framework abstractions hide it.

The Architecture You Start With#

Failure Mode 1: Chunking Strategy#

Failure Mode 2: Retrieval Quality#

Failure Mode 3: Evaluation#

Failure Mode 4: Context Stuffing and LLM Confusion#

The Production Architecture (After All That)#

Go Implementation Notes#