Retrieval-Augmented Generation
Ground the model in your data instead of hoping it remembers
RAG is the dominant pattern for LLM applications that need to answer questions about specific data — your documents, your database, your codebase. The shape is simple: at query time, retrieve the relevant content, paste it into the prompt, and let the model answer with citations.
Why Not Just Fine-Tune
Fine-tuning bakes knowledge into weights, but:
- Updating the data means re-training.
- The model can still hallucinate facts adjacent to what it learned.
- You lose citations — there's no link between the answer and a source.
RAG keeps the data outside the model. New documents are immediately available; old documents can be removed; every claim can be traced back to a source.
The Pipeline
A baseline RAG pipeline has three stages:
- Indexing — chunk documents, embed them, store in a vector database.
- Retrieval — embed the user query, fetch the top-K most similar chunks.
- Generation — paste the chunks into a prompt and ask the model to answer using only them.
This works. It's also where most teams stop, and it's why RAG often disappoints in production.
Where Naive RAG Breaks
- Bad chunking. Fixed-size chunks split mid-sentence, mid-idea, mid-table.
- Vector-only retrieval. Pure semantic search misses exact-match queries (product codes, names, error messages).
- No reranking. The top-K from a vector index is approximate; without reranking, the truly relevant doc is often #8, not #1.
- No query rewriting. Users ask questions; documents are written as statements. The semantic gap matters.
- No grounding enforcement. The model still cites things it made up if you don't make it cite real chunks.
A Better Default Pipeline
- Semantic chunking — split on document structure (headings, paragraphs), not byte counts.
- Hybrid retrieval — vector search + BM25 keyword search, combined.
- Reranking — use a cross-encoder or LLM to re-score the top-N candidates.
- Query rewriting — let the model rephrase the question before retrieval.
- Cited generation — require the model to attach source IDs to claims; reject responses without them.
Each addition costs latency and complexity; add them when measurable quality demands it.
Evaluation
RAG quality has two halves:
- Retrieval — did we find the right documents? Measure with recall@K against a labeled set.
- Generation — given the right documents, did the answer use them faithfully? LLM-as-judge with a faithfulness rubric works well.
Many RAG bugs are retrieval bugs masquerading as generation bugs. Eval them separately.
When Not to RAG
- The corpus is small enough to fit in context. Just paste it all.
- The data is highly structured. Use SQL or a knowledge graph; the model can query directly.
- The user's question doesn't need your data. Don't pay the retrieval tax for general questions.