Retrieval-Augmented Generation

RAG is the dominant pattern for LLM applications that need to answer questions about specific data — your documents, your database, your codebase. The shape is simple: at query time, retrieve the relevant content, paste it into the prompt, and let the model answer with citations.

Why Not Just Fine-Tune

Fine-tuning bakes knowledge into weights, but:

Updating the data means re-training.
The model can still hallucinate facts adjacent to what it learned.
You lose citations — there's no link between the answer and a source.

RAG keeps the data outside the model. New documents are immediately available; old documents can be removed; every claim can be traced back to a source.

The Pipeline

A baseline RAG pipeline has three stages:

Indexing — chunk documents, embed them, store in a vector database.
Retrieval — embed the user query, fetch the top-K most similar chunks.
Generation — paste the chunks into a prompt and ask the model to answer using only them.

This works. It's also where most teams stop, and it's why RAG often disappoints in production.

Where Naive RAG Breaks

Bad chunking. Fixed-size chunks split mid-sentence, mid-idea, mid-table.
Vector-only retrieval. Pure semantic search misses exact-match queries (product codes, names, error messages).
No reranking. The top-K from a vector index is approximate; without reranking, the truly relevant doc is often #8, not #1.
No query rewriting. Users ask questions; documents are written as statements. The semantic gap matters.
No grounding enforcement. The model still cites things it made up if you don't make it cite real chunks.

A Better Default Pipeline

Semantic chunking — split on document structure (headings, paragraphs), not byte counts.
Hybrid retrieval — vector search + BM25 keyword search, combined.
Reranking — use a cross-encoder or LLM to re-score the top-N candidates.
Query rewriting — let the model rephrase the question before retrieval.
Cited generation — require the model to attach source IDs to claims; reject responses without them.

Each addition costs latency and complexity; add them when measurable quality demands it.

Evaluation

RAG quality has two halves:

Retrieval — did we find the right documents? Measure with recall@K against a labeled set.
Generation — given the right documents, did the answer use them faithfully? LLM-as-judge with a faithfulness rubric works well.

Many RAG bugs are retrieval bugs masquerading as generation bugs. Eval them separately.

When Not to RAG

The corpus is small enough to fit in context. Just paste it all.
The data is highly structured. Use SQL or a knowledge graph; the model can query directly.
The user's question doesn't need your data. Don't pay the retrieval tax for general questions.