Steven's Knowledge

Retrieval-Augmented Generation

Ground the model in your data instead of hoping it remembers

RAG is the dominant pattern for LLM applications that need to answer questions about specific data — your documents, your database, your codebase. The shape is simple: at query time, retrieve the relevant content, paste it into the prompt, and let the model answer with citations.

Why Not Just Fine-Tune

Fine-tuning bakes knowledge into weights, but:

  • Updating the data means re-training.
  • The model can still hallucinate facts adjacent to what it learned.
  • You lose citations — there's no link between the answer and a source.

RAG keeps the data outside the model. New documents are immediately available; old documents can be removed; every claim can be traced back to a source.

The Pipeline

A baseline RAG pipeline has three stages:

  1. Indexing — chunk documents, embed them, store in a vector database.
  2. Retrieval — embed the user query, fetch the top-K most similar chunks.
  3. Generation — paste the chunks into a prompt and ask the model to answer using only them.

This works. It's also where most teams stop, and it's why RAG often disappoints in production.

Where Naive RAG Breaks

  • Bad chunking. Fixed-size chunks split mid-sentence, mid-idea, mid-table.
  • Vector-only retrieval. Pure semantic search misses exact-match queries (product codes, names, error messages).
  • No reranking. The top-K from a vector index is approximate; without reranking, the truly relevant doc is often #8, not #1.
  • No query rewriting. Users ask questions; documents are written as statements. The semantic gap matters.
  • No grounding enforcement. The model still cites things it made up if you don't make it cite real chunks.

A Better Default Pipeline

  • Semantic chunking — split on document structure (headings, paragraphs), not byte counts.
  • Hybrid retrieval — vector search + BM25 keyword search, combined.
  • Reranking — use a cross-encoder or LLM to re-score the top-N candidates.
  • Query rewriting — let the model rephrase the question before retrieval.
  • Cited generation — require the model to attach source IDs to claims; reject responses without them.

Each addition costs latency and complexity; add them when measurable quality demands it.

Evaluation

RAG quality has two halves:

  • Retrieval — did we find the right documents? Measure with recall@K against a labeled set.
  • Generation — given the right documents, did the answer use them faithfully? LLM-as-judge with a faithfulness rubric works well.

Many RAG bugs are retrieval bugs masquerading as generation bugs. Eval them separately.

When Not to RAG

  • The corpus is small enough to fit in context. Just paste it all.
  • The data is highly structured. Use SQL or a knowledge graph; the model can query directly.
  • The user's question doesn't need your data. Don't pay the retrieval tax for general questions.

On this page