Retrieval-Augmented Generation (RAG) Explained

A technical explanation of Retrieval-Augmented Generation — how it works, why it reduces hallucination, its architecture components, and when to use RAG versus fine-tuning.

What Is RAG

Retrieval-Augmented Generation (RAG) is an LLM architecture pattern that separates parametric knowledge (stored in model weights) from non-parametric knowledge (stored in an external document corpus). At inference time, relevant documents are retrieved and injected into the model’s context window before generation.

RAG was introduced by Lewis et al. (2020) at Meta AI as a method to improve factual accuracy on open-domain question answering tasks without retraining the base model.

Architecture

A standard RAG pipeline has three components:

1. Document Store A collection of documents (PDFs, web pages, database records) chunked into passages of typically 256–512 tokens and stored with associated embeddings in a vector database (Pinecone, Weaviate, pgvector, FAISS).

2. Retriever Given a user query, the retriever computes the query’s embedding and performs approximate nearest-neighbor (ANN) search against the document store to return the top-k most semantically similar passages. Dense retrieval (bi-encoder models like text-embedding-3-large) outperforms sparse retrieval (BM25) on most semantic tasks but can be combined in hybrid search.

3. Generator The retrieved passages are concatenated with the user query and a system prompt into the LLM’s context window. The model generates a response grounded in the retrieved content and can be prompted to cite specific passages.

Retrieval Methods

MethodMechanismBest For
Dense (bi-encoder)Embedding similaritySemantic questions
Sparse (BM25)Term frequency matchingKeyword-heavy queries
HybridDense + sparse fusionGeneral purpose
Re-rankingCross-encoder scoring of top-kHigh-precision tasks
HyDEGenerate hypothetical doc, then retrieveAmbiguous queries

RAG vs Fine-Tuning

Use RAG when:

  • Knowledge base updates frequently (daily/weekly)
  • Responses must be traceable to specific source documents
  • Domain corpus is large (millions of documents)
  • You need to deploy quickly without retraining

Use fine-tuning when:

  • You need to change the model’s reasoning style or output format
  • Domain-specific vocabulary or notation is critical
  • Latency requirements prohibit retrieval at inference time
  • The knowledge corpus is small and stable

Use both when: A fine-tuned model specialized in your domain’s reasoning patterns, combined with RAG for current factual grounding, outperforms either approach alone on most enterprise knowledge tasks.