Saad Ullah Bilal — AI Systems Architect

Every AI engineer has built a RAG demo that impresses in a notebook. Fifteen documents, clean text, simple questions — and it answers perfectly. Then you hit production: 50,000 documents, scanned PDFs, contradicting information, users asking questions the system was never designed for.

The retrieval step is responsible for roughly 80% of RAG failures in production. The generator is rarely the weak link — the bottleneck is almost always finding the right context in the first place.

This post is about that gap. What breaks, why it breaks, and how to fix it before it breaks you.

The Chunking Problem

Most tutorials suggest splitting documents every 500 tokens with a 50-token overlap. This works fine on Wikipedia articles. It fails on financial reports, legal contracts, and technical manuals — documents where meaning crosses page boundaries and context is everything.

The right chunking strategy depends entirely on your document type:

Narrative Text

Semantic chunking based on paragraph breaks and topic shifts for articles and reports.

Structured Documents

Section-aware chunking that preserves headers — critical for contracts and technical specs.

Tabular Data

Row-level chunking with column context prepended so each chunk is self-contained.

Mixed Documents

Hybrid approaches that detect document structure before deciding how to split.

Embedding Model Selection

OpenAI's text-embedding-ada-002 is not always the best choice. It's convenient and decent across the board, but domain-specific models often outperform it significantly.

For legal documents, legal-specific models fine-tuned on case law outperform general embeddings by 15–30% on retrieval accuracy. For code, models trained on code repositories produce dramatically better semantic matches. The evaluation process matters here: build a test set of 100–200 question-answer pairs from your actual documents, and benchmark before committing to an embedding model. Switching later is expensive.

Retrieval Failures and How to Handle Them

The most common failure mode in production RAG isn't hallucination — it's retrieval failure. The right context simply doesn't get pulled, and the model either makes something up or says it doesn't know.

When Dense Retrieval Wins

Semantic similarity searches

Paraphrased or rephrased queries

Concept-level lookups

Multi-lingual question matching

When Sparse Retrieval Wins

Exact keyword or phrase matches

Named entity and ID lookups

Code snippet retrieval

Highly technical jargon

Neither dense nor sparse retrieval alone is sufficient. Combine both with a reranker, use query rewriting to expand user queries before retrieval, and set confidence thresholds — if retrieval scores are too low, route to a fallback rather than generating a bad answer.

The Evaluation Loop

The most important thing you can build in a production RAG system isn't the retrieval pipeline — it's the evaluation pipeline. Log every query, every retrieved chunk, every response. Review failures weekly. The system should get measurably better over time.

RAGAS, TruLens, and similar frameworks give you retrieval precision, answer relevance, and faithfulness scores automatically. Set baselines before launch and track them.

The difference between a RAG system that earns trust and one that gets shut down after three months is usually whether the team built this feedback loop early — or treated evaluation as an afterthought.