RAG (Retrieval-Augmented Generation)¶
RAG enhances LLM responses by retrieving relevant information and including it in the prompt, allowing LLMs to answer questions about data they weren't trained on.
Why RAG?¶
RAG addresses LLM limitations (knowledge cutoff, no private data access, hallucination, limited context) by providing current information, grounding responses in actual data, and focusing context on relevant information.
Architecture¶
The RAG pipeline flows: Query → Retrieval → Context Assembly → LLM → Response. Retrieved documents come from a vector database containing embedded versions of source documents.
Core Components¶
Document Processing¶
Source documents are transformed through parsing (extract text), chunking (split into pieces), embedding (convert to vectors), and indexing (store for retrieval).
Chunking Strategies¶
| Strategy | Characteristics | Trade-offs |
|---|---|---|
| Fixed-size | Split every N tokens | Simple and predictable, may split mid-concept |
| Semantic | Split at natural boundaries | Preserves meaning, variable sizes |
| Overlapping | Chunks overlap by percentage | Helps with context at boundaries, increases storage |
| Hierarchical | Multiple levels (doc → section → paragraph) | Different granularity retrieval, more complex |
Chunk Size Trade-off
Smaller chunks enable precise retrieval but may miss context. Larger chunks provide more context but may include irrelevant information. Start with 200-500 tokens and adjust based on results.
Embeddings¶
Embeddings convert text to high-dimensional vectors for similarity search. Use the same embedding model for both indexing and querying, match the model to your content type, and consider multilingual needs if applicable.
Vector Storage¶
Vector databases store embeddings and support similarity search (nearest neighbors) and filtered search (similarity + metadata filters). Options range from simple in-memory solutions for development to dedicated vector databases for production.
Retrieval Strategies¶
| Strategy | Description | When to Use |
|---|---|---|
| Simple similarity | Return top-k most similar | Basic use cases |
| Filtered | Add metadata filters (date, source, etc.) | When you need scoping |
| Hybrid | Combine semantic + keyword search | Improves recall |
| Reranking | Retrieve more, then rerank with another model | Maximize precision |
Context Assembly¶
Combine retrieved documents into a prompt that instructs the LLM to answer based only on provided context and to acknowledge when information is insufficient.
def build_prompt(query, retrieved_docs):
context = "\n\n".join([doc.content for doc in retrieved_docs])
return f"""Answer based on the following context:
Context:
{context}
Question: {query}
Answer based only on the context provided. If the context doesn't contain
the answer, say you don't have enough information."""
Common Patterns¶
| Pattern | Description | Use Case |
|---|---|---|
| Basic RAG | Query → Retrieve → Generate | Straightforward Q&A |
| Query Enhancement | Expand/rewrite query before retrieval | Improve retrieval quality |
| Iterative RAG | Multiple retrieval rounds based on need | Complex questions requiring multiple sources |
| Agentic RAG | Agent decides retrieval strategy | Dynamic scenarios with multiple sources |
Quality Factors¶
Retrieval Quality¶
The most critical factor in RAG systems. Bad retrieval produces bad responses regardless of LLM quality. Measure whether relevant documents are retrieved, irrelevant ones filtered out, and ranking is correct. Improve through better chunking strategies, query enhancement, hybrid search, or reranking.
Context Utilization¶
Ensure the LLM uses retrieved context effectively through clear instructions, source attribution requirements, and structured context presentation. Watch for problems like ignoring relevant context or mixing up information from multiple sources.
Hallucination Prevention¶
RAG reduces but doesn't eliminate hallucination. Strategies include instructing to use only provided context, requiring source citations, verifying claims against context, and using lower temperature settings.
Implementation Considerations¶
Indexing Pipeline¶
Consider how often source data changes, whether to use incremental or full reindex, how to parallelize processing, and how to handle malformed documents.
Query Pipeline¶
Balance latency budget against quality, identify caching opportunities (embeddings, frequent queries), define fallback strategies for retrieval failures, and manage costs from embedding and LLM calls.
Maintenance¶
RAG systems require ongoing work: reindex when sources change, monitor retrieval quality metrics, update embeddings when better models become available, and handle growing data volumes.
Evaluation¶
| Metric | What It Measures |
|---|---|
| Retrieval precision | Are retrieved docs relevant? |
| Retrieval recall | Are all relevant docs retrieved? |
| Answer relevance | Does the answer address the question? |
| Faithfulness | Is the answer supported by context? |
| Answer completeness | Does it cover all aspects? |
Key Takeaways¶
RAG grounds LLM responses in actual data through a retrieval pipeline. Chunking strategy significantly impacts quality, and retrieval quality is the foundation of a successful RAG system. Start simple with basic retrieval and add complexity based on evaluation results. Monitor and maintain the system over time as data and requirements evolve.