Every AI engineer has built a RAG demo that impresses in a notebook. Fifteen documents, clean text, simple questions — and it answers perfectly. Then you hit production: 50,000 documents, scanned PDFs, contradicting information, users asking questions the system was never designed for.
The retrieval step is responsible for roughly 80% of RAG failures in production. The generator is rarely the weak link — the bottleneck is almost always finding the right context in the first place.
This post is about that gap. What breaks, why it breaks, and how to fix it before it breaks you.
The Chunking Problem
Most tutorials suggest splitting documents every 500 tokens with a 50-token overlap. This works fine on Wikipedia articles. It fails on financial reports, legal contracts, and technical manuals — documents where meaning crosses page boundaries and context is everything.
The right chunking strategy depends entirely on your document type:
Embedding Model Selection
OpenAI's text-embedding-ada-002 is not always the best choice. It's convenient and decent across the board, but domain-specific models often outperform it significantly.
For legal documents, legal-specific models fine-tuned on case law outperform general embeddings by 15–30% on retrieval accuracy. For code, models trained on code repositories produce dramatically better semantic matches. The evaluation process matters here: build a test set of 100–200 question-answer pairs from your actual documents, and benchmark before committing to an embedding model. Switching later is expensive.
Retrieval Failures and How to Handle Them
The most common failure mode in production RAG isn't hallucination — it's retrieval failure. The right context simply doesn't get pulled, and the model either makes something up or says it doesn't know.
Neither dense nor sparse retrieval alone is sufficient. Combine both with a reranker, use query rewriting to expand user queries before retrieval, and set confidence thresholds — if retrieval scores are too low, route to a fallback rather than generating a bad answer.
The Evaluation Loop
The most important thing you can build in a production RAG system isn't the retrieval pipeline — it's the evaluation pipeline. Log every query, every retrieved chunk, every response. Review failures weekly. The system should get measurably better over time.
RAGAS, TruLens, and similar frameworks give you retrieval precision, answer relevance, and faithfulness scores automatically. Set baselines before launch and track them.
The difference between a RAG system that earns trust and one that gets shut down after three months is usually whether the team built this feedback loop early — or treated evaluation as an afterthought.
