# Retrieval-Augmented Generation (RAG) Confidence: high Last verified: 2026-05-22 Generation: human_only ## TL;DR Retrieval-Augmented Generation (RAG) is an AI architecture introduced by Lewis et al. (2020) at Facebook AI Research that combines large language models with real-time external knowledge retrieval. Instead of relying solely on parametric knowledge (what the model memorized during training), RAG retrieves relevant documents from a knowledge base and provides them as context for generation. This reduces hallucination by ~50% in empirical studies, enables responses grounded in up-to-date and domain-specific information, and provides source attribution. RAG underpins AI search engines (Perplexity, Google AI Overviews, ChatGPT Search), enterprise knowledge bases, and research assistants (Elicit, Consensus) — making it the dominant architecture for production AI systems requiring factual accuracy. ## Core Explanation Traditional LLMs have a fundamental limitation: their knowledge is frozen at training time. A model trained in early 2025 cannot answer questions about events after that date; a general-purpose model lacks domain-specific knowledge about a company's internal documentation. RAG solves both problems by **decoupling knowledge storage from language generation**: ### Two-Phase Architecture **Phase 1 — Indexing (offline)**: 1. Documents are split into overlapping chunks (typically 256-1024 tokens) 2. Each chunk is embedded into a dense vector (768-3072 dimensions) using an encoder model 3. Vectors are stored in a vector database (Pinecone, Weaviate, pgvector, Qdrant) 4. Often indexed with Approximate Nearest Neighbor (ANN) algorithms like HNSW **Phase 2 — Retrieval + Generation (online)**: 1. User query is embedded into the same vector space 2. Top-k most similar document chunks are retrieved (typically k=5–20) by cosine similarity or dot product 3. Retrieved context is concatenated with the query and fed to the LLM: `"Answer based on: [Document 1] ... [Document k]. Question: [user query]"` 4. LLM generates a response grounded in the provided context This architecture enables the knowledge base to be updated in real-time without any model retraining. New documents can be added, stale documents removed, and access can be controlled per user — all without touching the LLM. ## Detailed Analysis ### RAG vs. Standard LLM Approaches | Aspect | Standard LLM (Parametric Only) | RAG (Parametric + Retrieval) | |--------|:-----------------------------:|:----------------------------:| | Knowledge freshness | Frozen at training cutoff | Updated in real-time | | Hallucination rate | Higher on factual/specific queries | Reduced by ~50% (empirical, domain-dependent) | | Source attribution | None (model's "memory") | Retrieved documents are citable | | Domain adaptation | Requires fine-tuning | Swap or augment the knowledge base | | Inference cost | Lower (single forward pass) | Higher (retrieval adds 200ms–2s latency) | | Factual precision on niche topics | Lower | Higher when knowledge base covers the domain | | Deployment complexity | Simple | Moderate (maintain vector DB + embedding pipeline) | ### Chunking Strategies How documents are split significantly impacts retrieval quality: | Strategy | Method | Best For | |----------|--------|----------| | **Fixed-size** | 256-1024 token chunks with 10-20% overlap | General purpose, simple to implement | | **Semantic** | Split at natural boundaries (paragraphs, sections) | Narrative documents, articles | | **Recursive** | Hierarchical chunks (parent → child) at different granularities | Long documents requiring multi-level context | | **Sentence-based** | One or few sentences per chunk | Factoid QA, precise retrieval | | **Agentic** | LLM determines optimal split points | Complex, heterogeneous documents | A common pattern: use smaller chunks for precise retrieval (~256 tokens), but expand to surrounding context (parent document, adjacent chunks) for generation — this combines retrieval precision with generation context richness. ### Embedding Models The quality of the embedding model is one of the strongest determinants of RAG performance. Key models as of 2025-2026: | Model | Dimensions | Context Length | Developer | Open? | |-------|:---------:|:-------------:|-----------|:-----:| | text-embedding-3-large | 256-3072 (variable) | 8,191 | OpenAI | ❌ | | text-embedding-3-small | 512-1536 | 8,191 | OpenAI | ❌ | | Cohere Embed v3 | 1024 | 512 | Cohere | ❌ | | BGE-M3 | 1024 | 8,192 | BAAI | ✅ | | E5-mistral-7b-instruct | 4096 | 32,768 | Microsoft | ✅ | *The ability to tune embedding dimensions trades accuracy for storage/performance — text-embedding-3 at 256 dimensions retains ~98% of the MTEB score while using 1/12th the storage.* ### Retrieval Enhancement: Re-ranking After initial retrieval (fast but approximate), a cross-encoder model re-scores the top-k candidates (slow but precise). This two-stage pipeline is standard in production RAG systems: 1. **Stage 1**: Bi-encoder (embedding model) retrieves top-50 or top-100 candidates 2. **Stage 2**: Cross-encoder scores query-document pairs jointly, re-ranking to top-5 or top-10 Cross-encoders are significantly more accurate but also much slower (they process query + document together, rather than pre-computing document embeddings). The two-stage approach balances speed and quality. ### RAG Variants | Variant | Description | Use Case | |---------|------------|----------| | **Naive RAG** | Single retrieval → single generation, no enhancement | Simple Q&A over small document sets | | **Advanced RAG** | Query rewriting + re-ranking + context compression | Production enterprise systems | | **Agentic RAG** | Multi-step retrieval with tool use (search→filter→verify→generate) | Complex research, multi-hop questions | | **Graph RAG** | Retrieves from knowledge graphs (entities + relationships, not just vectors) | Entity-rich domains, relationship queries | | **Multimodal RAG** | Retrieves and reasons over images, tables, and text | Healthcare, legal, technical docs | ### Applications RAG is the underlying architecture for most production AI systems requiring factual accuracy: | Application | Examples | How RAG Is Used | |------------|----------|----------------| | AI Search Engines | Perplexity, Google AI Overviews, ChatGPT Search | Retrieve web pages, cite sources per claim | | Enterprise KB | Internal docs Q&A, customer support | Index company documentation, restrict by access | | Research | Elicit, Consensus, Scite | Retrieve academic papers, synthesize findings with citations | | Legal | Contract analysis, regulatory compliance | Index laws and precedents, ground analysis in documents | | Healthcare | Clinical decision support | Retrieve medical literature, ground recommendations in evidence | ## Further Reading - [RAG Paper (Lewis et al., 2020)](https://arxiv.org/abs/2005.11401): The foundational RAG paper (5K+ citations) - [DPR Paper (Karpukhin et al., 2020)](https://arxiv.org/abs/2004.04906): Dense passage retrieval - [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/): Practical implementation