Skip to content

RAG (Retrieval-Augmented Generation)

RAG enhances LLM responses by retrieving relevant information and including it in the prompt, allowing LLMs to answer questions about data they weren't trained on.

Why RAG?

RAG addresses LLM limitations (knowledge cutoff, no private data access, hallucination, limited context) by providing current information, grounding responses in actual data, and focusing context on relevant information.

Architecture

The RAG pipeline flows: Query → Retrieval → Context Assembly → LLM → Response. Retrieved documents come from a vector database containing embedded versions of source documents.

Core Components

Document Processing

Source documents are transformed through parsing (extract text), chunking (split into pieces), embedding (convert to vectors), and indexing (store for retrieval).

Chunking Strategies

Strategy Characteristics Trade-offs
Fixed-size Split every N tokens Simple and predictable, may split mid-concept
Semantic Split at natural boundaries Preserves meaning, variable sizes
Overlapping Chunks overlap by percentage Helps with context at boundaries, increases storage
Hierarchical Multiple levels (doc → section → paragraph) Different granularity retrieval, more complex

Chunk Size Trade-off

Smaller chunks enable precise retrieval but may miss context. Larger chunks provide more context but may include irrelevant information. Start with 200-500 tokens and adjust based on results.

Embeddings

Embeddings convert text to high-dimensional vectors for similarity search. Use the same embedding model for both indexing and querying, match the model to your content type, and consider multilingual needs if applicable.

Vector Storage

Vector databases store embeddings and support similarity search (nearest neighbors) and filtered search (similarity + metadata filters). Options range from simple in-memory solutions for development to dedicated vector databases for production.

Retrieval Strategies

Strategy Description When to Use
Simple similarity Return top-k most similar Basic use cases
Filtered Add metadata filters (date, source, etc.) When you need scoping
Hybrid Combine semantic + keyword search Improves recall
Reranking Retrieve more, then rerank with another model Maximize precision

Context Assembly

Combine retrieved documents into a prompt that instructs the LLM to answer based only on provided context and to acknowledge when information is insufficient.

def build_prompt(query, retrieved_docs):
    context = "\n\n".join([doc.content for doc in retrieved_docs])

    return f"""Answer based on the following context:

Context:
{context}

Question: {query}

Answer based only on the context provided. If the context doesn't contain
the answer, say you don't have enough information."""

Common Patterns

Pattern Description Use Case
Basic RAG Query → Retrieve → Generate Straightforward Q&A
Query Enhancement Expand/rewrite query before retrieval Improve retrieval quality
Iterative RAG Multiple retrieval rounds based on need Complex questions requiring multiple sources
Agentic RAG Agent decides retrieval strategy Dynamic scenarios with multiple sources

Quality Factors

Retrieval Quality

The most critical factor in RAG systems. Bad retrieval produces bad responses regardless of LLM quality. Measure whether relevant documents are retrieved, irrelevant ones filtered out, and ranking is correct. Improve through better chunking strategies, query enhancement, hybrid search, or reranking.

Context Utilization

Ensure the LLM uses retrieved context effectively through clear instructions, source attribution requirements, and structured context presentation. Watch for problems like ignoring relevant context or mixing up information from multiple sources.

Hallucination Prevention

RAG reduces but doesn't eliminate hallucination. Strategies include instructing to use only provided context, requiring source citations, verifying claims against context, and using lower temperature settings.

Implementation Considerations

Indexing Pipeline

Consider how often source data changes, whether to use incremental or full reindex, how to parallelize processing, and how to handle malformed documents.

Query Pipeline

Balance latency budget against quality, identify caching opportunities (embeddings, frequent queries), define fallback strategies for retrieval failures, and manage costs from embedding and LLM calls.

Maintenance

RAG systems require ongoing work: reindex when sources change, monitor retrieval quality metrics, update embeddings when better models become available, and handle growing data volumes.

Evaluation

Metric What It Measures
Retrieval precision Are retrieved docs relevant?
Retrieval recall Are all relevant docs retrieved?
Answer relevance Does the answer address the question?
Faithfulness Is the answer supported by context?
Answer completeness Does it cover all aspects?

Key Takeaways

RAG grounds LLM responses in actual data through a retrieval pipeline. Chunking strategy significantly impacts quality, and retrieval quality is the foundation of a successful RAG system. Start simple with basic retrieval and add complexity based on evaluation results. Monitor and maintain the system over time as data and requirements evolve.