Retrieval-Augmented Generation
Large language models have a dirty secret, and it is not the one the op-eds keep warning you about. The secret is this: they make things up. Confidently, fluently, and with impeccable grammar, they fabricate facts, invent citations, and hallucinate details that sound plausible but are entirely fictional. Ask a model about an obscure legal case, and it may generate a beautifully formatted citation to a case that has never existed. Ask it about your company's refund policy, and it will cheerfully produce one — it just may not be your company's refund policy.
This is not a bug that will be fixed in the next release. It is an architectural feature of how language models work. They predict the next most likely token based on patterns in their training data. They do not "know" things in any meaningful sense. They generate plausible text. Sometimes plausible and true overlap. Sometimes they do not.
Retrieval-Augmented Generation — RAG — is the engineering solution to this problem. Instead of asking the model to generate answers from its parametric memory (the weights learned during training), you first retrieve relevant documents from a knowledge base and then augment the model's prompt with those documents. The model generates its answer grounded in the retrieved context rather than fabricating from whole cloth.
RAG is not glamorous. It is plumbing. But it is the plumbing that makes AI-powered knowledge systems actually work in production, and understanding it deeply is essential for anyone building or evaluating these systems.
Why RAG Exists
RAG addresses three fundamental limitations of large language models:
The hallucination problem. As described above, models generate plausible text regardless of factual accuracy. By providing relevant source documents in the prompt, RAG constrains the model's output to information that actually exists in your knowledge base. The model can still hallucinate, but the probability decreases significantly when correct information is right there in the context.
The knowledge cutoff problem. Models are trained on data up to a specific date. They know nothing about events after their training cutoff. Your company launched a new product last month? The model has no idea. RAG solves this by retrieving current documents at query time, ensuring the model has access to up-to-date information regardless of when it was trained.
The proprietary knowledge problem. Models are trained on public data. They do not know your internal procedures, your customer data, your engineering documentation, or your HR policies. RAG lets you connect a model to your private knowledge base without fine-tuning, retraining, or sharing your data with the model provider.
There is a fourth, more practical reason RAG is popular: it is dramatically cheaper and faster to implement than fine-tuning. Fine-tuning a large model on your data requires significant compute resources, machine learning expertise, and ongoing maintenance. RAG requires a vector database and some glue code. For most enterprise use cases, RAG provides 80% of the benefit at 10% of the cost.
The RAG Architecture
At its core, RAG is a two-phase system: an offline indexing phase and an online query phase.
Offline Indexing Phase
Before you can retrieve anything, you need to index your documents. This involves:
-
Document ingestion. Collect your source documents — PDFs, web pages, Markdown files, database records, Slack messages, whatever constitutes your knowledge base.
-
Document chunking. Split documents into smaller pieces (chunks) that are appropriately sized for embedding and retrieval. This is more subtle than it sounds, and we will discuss strategies shortly.
-
Embedding generation. Convert each chunk into a dense vector representation (an embedding) using an embedding model. These vectors capture the semantic meaning of the text.
-
Vector storage. Store the embeddings (along with the original text and any metadata) in a vector database optimized for similarity search.
Online Query Phase
When a user asks a question:
-
Query embedding. Convert the user's question into a vector using the same embedding model used during indexing.
-
Retrieval. Search the vector database for chunks whose embeddings are most similar to the query embedding. Return the top-k most relevant chunks.
-
Context assembly. Assemble the retrieved chunks into a prompt, typically with instructions telling the model to answer based on the provided context.
-
Generation. Send the augmented prompt to the language model. The model generates an answer grounded in the retrieved context.
-
Post-processing. Optionally, extract citations, check for hallucinations, format the response, or apply other quality controls.
That is the skeleton. Now let us put flesh on the bones.
Document Chunking Strategies
Chunking is where most RAG pipelines quietly succeed or fail. The goal is to create chunks that are large enough to contain meaningful, self-contained information but small enough to be relevant to specific queries. Get this wrong, and your retrieval will return chunks that are either too vague to be useful or too narrow to provide context.
Fixed-Size Chunking
The simplest approach: split text into chunks of a fixed number of tokens (or characters), with optional overlap between consecutive chunks.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n"
)
chunks = splitter.split_text(document_text)
The overlap is important. Without it, information that spans a chunk boundary gets split across two chunks, and neither chunk contains the complete thought. A 10-20% overlap is typical.
Fixed-size chunking is fast, predictable, and works reasonably well for homogeneous text. It works poorly for structured documents where the logical boundaries (section headers, paragraphs, code blocks) do not align with the fixed chunk size. You end up with chunks that start mid-sentence or split a code example in half.
Semantic Chunking
Instead of splitting by character count, semantic chunking splits by meaning. The idea is to identify natural breakpoints in the text — paragraph boundaries, topic shifts, section headers — and split there.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document_text)
Semantic chunking produces more coherent chunks, but it is slower (it requires embedding computation during chunking) and less predictable (chunk sizes vary). It also depends on the quality of the embedding model — a poor embedding model will identify poor breakpoints.
Recursive Character Splitting
A pragmatic middle ground: try to split on natural boundaries (double newlines, then single newlines, then spaces), falling back to character-level splitting only when necessary.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
This is the default choice for most production RAG systems, and for good reason. It respects document structure when possible while maintaining predictable chunk sizes. It is the chunking equivalent of sensible shoes — not exciting, but reliable.
Document-Aware Chunking
For structured documents (Markdown, HTML, code files), you can use format-aware splitters that understand the document structure:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "header_1"),
("##", "header_2"),
("###", "header_3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_text)
Each chunk retains the header hierarchy as metadata, which is invaluable for retrieval. When a user asks about "installation instructions," you want to retrieve the chunk under the "Installation" header, not a chunk that happens to mention the word "install" in a different context.
Choosing a Chunking Strategy
There is no universally optimal chunking strategy. The right choice depends on your documents and your queries. Some guidelines:
- Homogeneous, unstructured text (transcripts, articles): recursive character splitting with 200-500 token overlap.
- Structured documents (documentation, manuals): document-aware splitting that respects headers and sections.
- Code: language-aware splitting that respects function and class boundaries.
- Mixed content: use different strategies for different document types.
A chunk size of 500-1000 tokens works well for most use cases. Smaller chunks improve retrieval precision (you get exactly the relevant snippet) but lose context. Larger chunks preserve context but may include irrelevant information that distracts the model.
Vector Stores
Once you have your chunks embedded, you need somewhere to store them and search them efficiently. This is the job of the vector store.
FAISS
Facebook AI Similarity Search is the granddaddy of vector search libraries. It is fast, memory-efficient, and battle-tested. It runs in-process (no separate server needed), which makes it excellent for prototyping and small-to-medium datasets.
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(
texts=chunks,
embedding=embeddings,
metadatas=metadata_list
)
# Save to disk
vectorstore.save_local("faiss_index")
# Search
results = vectorstore.similarity_search(
"How do I configure the database?",
k=5
)
FAISS limitations: no built-in persistence (you serialize to disk manually), no metadata filtering without additional infrastructure, and scaling beyond a single machine requires custom engineering.
ChromaDB
Chroma is an open-source embedding database designed specifically for AI applications. It runs as an embedded database or a client-server architecture, supports metadata filtering, and provides a clean API.
import chromadb
from chromadb.utils import embedding_functions
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=ef
)
collection.add(
documents=chunks,
metadatas=metadata_list,
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
results = collection.query(
query_texts=["How do I configure the database?"],
n_results=5,
where={"source": "admin_guide"} # metadata filtering
)
Chroma is excellent for prototyping and medium-scale applications. Its metadata filtering is genuinely useful — being able to restrict retrieval to specific document sources, date ranges, or categories significantly improves relevance.
Qdrant
Qdrant is a production-grade vector database built in Rust. It supports filtering, payload storage, and horizontal scaling. If you are building a system that needs to handle millions of vectors with complex filtering requirements, Qdrant is a strong choice.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="localhost", port=6333)
client.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
)
)
# Upsert vectors
client.upsert(
collection_name="knowledge_base",
points=[
PointStruct(
id=i,
vector=embedding,
payload={"text": chunk, "source": source}
)
for i, (embedding, chunk, source)
in enumerate(zip(embeddings, chunks, sources))
]
)
# Search with filtering
results = client.search(
collection_name="knowledge_base",
query_vector=query_embedding,
query_filter={
"must": [{"key": "source", "match": {"value": "admin_guide"}}]
},
limit=5
)
pgvector
If you are already running PostgreSQL — and in 2026, who is not — pgvector adds vector similarity search directly to your existing database. No additional infrastructure, no new operational burden, and you get the full power of SQL for filtering and joins.
-- Enable the extension
CREATE EXTENSION vector;
-- Create a table with a vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
source VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW(),
embedding vector(1536)
);
-- Create an index for fast similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Search
SELECT content, source,
1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE source = 'admin_guide'
ORDER BY embedding <=> query_embedding
LIMIT 5;
pgvector is not as fast as purpose-built vector databases for large-scale workloads, but for most applications, the convenience of staying within PostgreSQL outweighs the performance difference. Operational simplicity is an underrated virtue.
Choosing a Vector Store
| Use Case | Recommended |
|---|---|
| Prototyping, small datasets | FAISS or ChromaDB |
| Medium-scale, need metadata filtering | ChromaDB or Qdrant |
| Large-scale production | Qdrant or pgvector |
| Already using PostgreSQL | pgvector |
| Need horizontal scaling | Qdrant |
Retrieval Strategies
Getting the right documents from the vector store is the single most important step in the RAG pipeline. A model that receives relevant context will generate good answers. A model that receives irrelevant context will generate confidently wrong answers. Garbage in, garbage out — but with better punctuation.
Top-K Similarity Search
The simplest retrieval strategy: embed the query, find the k most similar document chunks, return them. This is what most introductory RAG tutorials use, and it works surprisingly well for straightforward queries.
The choice of k matters. Too small, and you miss relevant context. Too large, and you flood the model with noise, consuming context window tokens on irrelevant text that the model then has to ignore (or worse, gets confused by). k=3 to k=5 is a reasonable starting point for most applications. Tune based on your specific use case.
Maximum Marginal Relevance (MMR)
Top-k retrieval has a diversity problem. If your knowledge base contains five slightly different paragraphs that all say roughly the same thing, top-k will retrieve all five, wasting your context window on redundant information. You get five ways of saying the same thing and zero ways of saying anything else.
MMR addresses this by balancing relevance and diversity. It selects documents that are both similar to the query and dissimilar to each other:
results = vectorstore.max_marginal_relevance_search(
query="How do I configure the database?",
k=5,
fetch_k=20, # fetch 20 candidates, select 5 diverse ones
lambda_mult=0.7 # 0=max diversity, 1=max relevance
)
The lambda_mult parameter controls the tradeoff. A value of 0.7 leans toward relevance while still penalizing redundancy. In practice, MMR almost always outperforms naive top-k for knowledge base queries.
Hybrid Search
Pure semantic search has a blind spot: it can miss exact keyword matches. If a user searches for "error code E-4072" and your knowledge base has a document titled "Troubleshooting Error E-4072," semantic search might rank it lower than a document about "common database errors" that is semantically closer to the query in embedding space but does not mention the specific error code.
Hybrid search combines semantic search (vector similarity) with keyword search (BM25 or similar) and fuses the results:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Keyword-based retriever
bm25_retriever = BM25Retriever.from_texts(chunks)
bm25_retriever.k = 5
# Vector-based retriever
vector_retriever = vectorstore.as_retriever(
search_kwargs={"k": 5}
)
# Combine with reciprocal rank fusion
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6]
)
results = ensemble_retriever.invoke(
"How do I fix error code E-4072?"
)
Hybrid search is the retrieval strategy for production systems. It handles both semantic queries ("how do I set up the database?") and keyword queries ("error E-4072") gracefully. The weighting between keyword and semantic components is tunable and should be adjusted based on your query distribution.
Reranking
Retrieval from a vector store is fast but approximate. The embedding similarity between a query and a document is a rough proxy for relevance, but it misses nuances that a more sophisticated model can capture. Reranking adds a second pass: take the top-N candidates from retrieval and rerank them using a cross-encoder model that considers the query and each document jointly.
from sentence_transformers import CrossEncoder
# Retrieve candidates
candidates = vectorstore.similarity_search(query, k=20)
# Rerank with a cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by reranking score
reranked = [doc for _, doc in sorted(
zip(scores, candidates),
key=lambda x: x[0],
reverse=True
)][:5]
Reranking is computationally expensive compared to vector search, which is why you apply it to a small candidate set (typically 20-50 documents) rather than the entire corpus. The improvement in relevance is often dramatic — studies consistently show 10-30% improvements in retrieval quality with reranking.
The End-to-End RAG Pipeline
Here is a complete, minimal RAG pipeline in Python:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# 1. Index documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 2. Define the prompt
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following context. If the context
does not contain enough information to answer, say so explicitly.
Do not make up information.
Context:
{context}
Question: {question}
Answer:
""")
# 3. Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def format_docs(docs):
return "\n\n---\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 4. Query
answer = chain.invoke("How do I configure the database connection?")
print(answer)
This is roughly fifty lines of code. It is also roughly 80% of what most production RAG systems do, with the remaining 20% being error handling, monitoring, caching, authentication, and the other unglamorous necessities of production software.
Common Failure Modes and Debugging
RAG systems fail in predictable ways. Understanding these failure modes is half the battle.
Retrieval Returns Irrelevant Documents
Symptoms: The model produces answers that are technically well-formed but clearly based on wrong context. It answers a question about database configuration by citing the email server documentation.
Diagnosis: Examine the retrieved documents. Are they relevant to the query? If not, the problem is in retrieval, not generation.
Common causes:
- Chunk size too large (chunks contain a mix of relevant and irrelevant information, and the irrelevant part dominates the embedding).
- Poor embedding model choice for your domain.
- Missing metadata filtering (retrieving from the wrong document collection).
- Query is ambiguous and the embedding cannot disambiguate.
Fixes: Try smaller chunks, add metadata filtering, use hybrid search, or rephrase queries before embedding.
Retrieval Returns Relevant Documents But Model Ignores Them
Symptoms: The retrieved documents contain the answer, but the model generates something different — often drawing on its parametric knowledge instead of the provided context.
Diagnosis: Check the prompt. Is the instruction to use the provided context clear enough? Is the context positioned effectively in the prompt?
Common causes:
- Weak prompting. The model needs explicit instruction to prioritize the provided context over its training data.
- Context too long. When the prompt contains many chunks, the model may lose track of the relevant information (the "lost in the middle" problem, where models pay less attention to content in the middle of long contexts).
- Model temperature too high, encouraging creative generation rather than faithful extraction.
Fixes: Strengthen the system prompt, reduce the number of retrieved chunks, place the most relevant chunks at the beginning and end of the context, set temperature to 0 or near 0.
The System Hallucinates Despite RAG
Symptoms: The model generates claims that are not in the retrieved context and are not true.
Diagnosis: This happens when the retrieved context is insufficient to answer the query, but the model generates an answer anyway rather than admitting ignorance.
Fixes: Instruct the model explicitly to say "I don't know" or "the provided documents don't contain this information" when the context is insufficient. Implement post-generation checking that verifies claims against the source documents. Consider adding a confidence score.
Chunking Splits Critical Information
Symptoms: The answer to a question requires information from multiple parts of a document, but chunking has split it across separate chunks that are not all retrieved.
Diagnosis: Look at the original document and the chunks. Is the relevant information split across chunk boundaries?
Fixes: Increase chunk overlap, use parent document retrieval (retrieve the chunk but pass the parent document to the model), or use document-aware chunking that respects logical boundaries.
Performance Degrades as Knowledge Base Grows
Symptoms: The system worked well with 100 documents but quality drops at 10,000 documents. Retrieval returns marginally relevant documents that crowd out the truly relevant ones.
Diagnosis: As the corpus grows, the semantic neighborhood of any query becomes more crowded. Chunks that are vaguely similar to the query proliferate.
Fixes: Add metadata filtering to narrow the search space. Use hybrid search. Implement reranking. Consider hierarchical retrieval (first identify the relevant document, then search within it).
Advanced Patterns
A few patterns worth knowing, even if you do not implement them immediately:
Query transformation. Before embedding the user's query, transform it to improve retrieval. This might mean generating a hypothetical answer (HyDE — Hypothetical Document Embeddings) and using that as the search query, or breaking a complex query into sub-queries that are each retrieved independently.
Parent document retrieval. Index small chunks for precise retrieval, but return the larger parent document (or section) for context. This gives you the best of both worlds: precise retrieval with rich context.
Self-querying. Use a language model to extract structured filters from a natural language query. "What were the Q3 2025 revenue numbers for the enterprise segment?" becomes a vector search for revenue information filtered by date=Q3-2025 and segment=enterprise.
Agentic RAG. Instead of a single retrieval-generation cycle, use an agent that can iteratively search, evaluate results, refine queries, and search again until it has sufficient context to answer the question. This is more complex and more expensive, but dramatically more capable for complex queries.
RAG is not a silver bullet. It does not solve the fundamental problem of language model reliability. But it converts AI from a parlor trick that generates plausible-sounding fiction into a genuinely useful tool that generates answers grounded in your actual knowledge base. That is a transformation worth understanding in detail.