Building the Perfect RAG

Every RAG prototype works. The embedding model is fast, the top-5 results look relevant, and the LLM strings together a coherent answer. Then you ship it, and reality arrives: answers that ignore the retrieved context, retrievals that return the wrong chunk from the right document, and a context window that quietly overflows and silently drops evidence.

RAG is not a solved problem you plug in — it is a pipeline you engineer. This post is a practical guide to building one that actually holds up in production.

What RAG is (and isn't)
The anatomy of a RAG pipeline
Common pitfalls and how to avoid them
Advanced techniques that move the needle
Evaluation strategies
Architecture decision guide

What RAG Is (and Isn't)

Retrieval-Augmented Generation is a pattern: retrieve relevant documents from an external store, inject them into the LLM's context, then generate a grounded response. That's it.

What it is not: a magic accuracy layer you bolt onto any model. RAG amplifies the quality of your retrieval. Garbage in, garbage out — just faster and with a citation.

The value proposition is real though. As covered in Why RAG beats fine-tuning for most use cases, RAG gives you up-to-date, maintainable knowledge without retraining — which is why it has become the default architecture for production knowledge bases, support bots, and document Q&A systems.

The Anatomy of a RAG Pipeline

A production RAG pipeline has five moving parts. Each one has failure modes.

1. Chunking

Chunking converts raw documents into indexable units. Get this wrong and every downstream step suffers — retrieval returns the wrong neighborhood, the LLM sees truncated context, and semantic coherence collapses.

Naive fixed-size chunking is a trap. Splitting every 512 tokens without regard for structure severs sentences, splits tables, and buries headings in the wrong chunk.

Better strategies:

Recursive character splitting — respects paragraph, sentence, and word boundaries in priority order.
Semantic chunking — uses embedding similarity to detect topic shifts and split there instead of at an arbitrary token count.
Hierarchical (parent-child) chunking — index small chunks for precision retrieval, but return the surrounding parent chunk as context to the LLM.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=128,           # ~25% overlap preserves cross-boundary context
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(docs)

Pro tip: Add document-level metadata to every chunk at index time — source URL, section title, creation date. You will need it for filtering and attribution later.

2. Embedding

Each chunk gets converted into a dense vector that encodes its semantic meaning. Similarity search happens in this vector space.

Model choices matter. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source models like bge-large-en-v1.5 or nomic-embed-text all have different strengths. The key variables:

Dimensionality — higher dimensions generally capture more nuance, but increase storage and query latency.
Domain alignment — a general-purpose embedding model may underperform on legal, medical, or code corpora. Evaluate on your actual data before committing.
Context window — some embedding models max out at 512 tokens. Long chunks get silently truncated.

from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=1024   # smaller than max (3072) — good balance of quality vs. cost
    )
    return [r.embedding for r in response.data]

3. Vector Store

The vector store holds your embeddings and serves approximate nearest-neighbor (ANN) queries. Common options:

Pinecone — Managed, low-ops, production scale
Weaviate — Hybrid search built-in, self-hostable
Qdrant — High-performance, self-hostable, rich filtering
pgvector — Already on Postgres? Start here
ChromaDB — Local dev and prototyping

For most production workloads: Qdrant or Weaviate if you self-host, Pinecone if you want zero infra overhead.

4. Retrieval

The retrieval step takes the user query, embeds it, and fetches the top-k most similar chunks. This is where most RAG systems silently fail.

Top-k is not a fixed constant. 3 is too few for multi-part questions. 20 is too many for a 4k context window. The right value depends on your chunk size, context window budget, and query complexity — and it should be tunable at runtime.

5. Generation

The LLM receives the query plus retrieved context and generates the final answer. Key decisions here:

System prompt design — be explicit: "Answer only from the provided context. If the context does not contain the answer, say so."
Context ordering — models tend to weight the beginning and end of context more than the middle ("lost in the middle" problem). Put your highest-confidence chunks first and last.
Citation enforcement — instruct the model to cite the source document name or chunk ID. This surfaces retrieval errors immediately during eval.

Common Pitfalls and How to Avoid Them

Bad Chunking Strategies

The most common failure: chunks that cut across the very information the user is asking about. A table split in half, a code block severed at line 3, a numbered list where only items 1–4 land in the chunk.

Fix: inspect your chunks visually before indexing. Build a small script that prints 20 random chunks. If they look broken to you, they look broken to the retriever.

Poor Retrieval Quality

High cosine similarity does not mean high relevance. Two chunks can be semantically close in embedding space while one is completely off-topic in context.

Common causes:

Query is too vague — the embedding model can not distinguish what the user actually wants.
Chunks are too long — the embedding averages over too much content, diluting the signal.
Mismatch between embedding model training distribution and your corpus.

Context Window Mismanagement

LLMs have hard token limits. If your retrieval returns 20 chunks of 512 tokens each, that is 10,240 tokens of context before the prompt, system message, or answer. You will either hit the limit silently (truncation) or throw an error.

Track token counts explicitly at runtime:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def build_context(chunks: list[str], max_tokens: int = 6000) -> str:
    context_parts = []
    used = 0
    for chunk in chunks:
        n = count_tokens(chunk)
        if used + n > max_tokens:
            break
        context_parts.append(chunk)
        used += n
    return "\n\n---\n\n".join(context_parts)

Hallucinations Despite Retrieval

RAG reduces hallucinations; it does not eliminate them. The model can still confabulate details not present in the retrieved context, especially if:

The retrieved chunks are only loosely relevant.
The system prompt does not explicitly constrain generation to the context.
The model is large and confident enough to override retrieved evidence.

Mitigation: add a faithfulness check post-generation (more on this in the evaluation section), and consider adding an explicit "no-answer" path when retrieval confidence is low.

Advanced Techniques That Move the Needle

These are the techniques that separate a demo from a production system.

Hybrid Search

Pure vector search misses exact keyword matches. Hybrid search combines dense retrieval (semantic) with sparse retrieval (BM25 / TF-IDF) and blends the scores — typically via Reciprocal Rank Fusion (RRF).

# Qdrant hybrid search example
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector

results = client.query_points(
    collection_name="my_collection",
    prefetch=[
        {"query": dense_vector, "using": "dense", "limit": 20},
        {"query": SparseVector(indices=sparse_indices, values=sparse_values),
         "using": "sparse", "limit": 20},
    ],
    query={"fusion": "rrf"},
    limit=10,
)

Hybrid search reliably outperforms either approach alone, especially for product names, codes, and proper nouns that embedding models tend to blur together.

Re-ranking

Retrieve a large candidate set (top-20 or top-50), then run a cross-encoder re-ranker over the candidates to re-score them. Cross-encoders are slower than bi-encoders but far more accurate because they see the query and document together.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[str], top_k: int = 5) -> list[str]:
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

Cohere's Rerank API offers a strong managed alternative if you do not want to host a model.

HyDE (Hypothetical Document Embeddings)

A clever trick: instead of embedding the user's raw query, ask the LLM to generate a hypothetical document that would answer the query, then embed that hypothetical document. Because it looks like an answer rather than a question, it sits closer in embedding space to your actual answer chunks.

def hyde_embed(query: str) -> list[float]:
    hypothetical_doc = llm.complete(
        f"Write a short paragraph that directly answers this question:\n{query}"
    )
    return embed([hypothetical_doc])[0]

HyDE is particularly effective for technical Q&A where queries are terse ("how does X work?") but documents are verbose.

Query Expansion and Decomposition

Multi-hop questions ("Which customers from the 2023 cohort upgraded to the enterprise plan?") are a single embedding query's nightmare. Decompose them:

sub_queries = llm.complete(
    f"Break this question into 2-4 simpler sub-questions:\n{query}"
)
# Run retrieval for each sub-query, deduplicate, then generate

Similarly, expanding a short query into 3–5 paraphrases and union-ing the results reduces sensitivity to exact wording.

Metadata Filtering

Do not retrieve from your entire corpus if you do not have to. Metadata filters (date range, document type, user permission scope, product area) drastically shrink the search space before vector similarity runs, improving both speed and relevance.

results = vector_store.search(
    query_vector=embed(query),
    filter={"source": "support_tickets", "date": {"gte": "2025-01-01"}},
    top_k=10
)

Contextual Compression

Return only the relevant sentences from each retrieved chunk, not the whole chunk. This keeps context tight and reduces the "lost in the middle" effect.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

Evaluation Strategies

You cannot improve what you do not measure. Build an eval loop before you optimize anything.

RAGAS

RAGAS is the standard open-source framework for RAG evaluation. It measures four dimensions without requiring ground-truth labels for every question:

Faithfulness — Is the answer grounded in the retrieved context?
Answer Relevancy — Does the answer address the question?
Context Precision — Are the retrieved chunks actually relevant to the question?
Context Recall — Did retrieval surface all the evidence needed?

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,   # HuggingFace Dataset with question/answer/contexts
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)

LLM-as-Judge

For qualitative evaluation at scale, use a capable model (GPT-4o, Claude Opus) to score responses on correctness, completeness, and tone. Define a rubric, enforce it via structured output, and aggregate scores.

Note: LLM judges are consistent but not perfectly calibrated. Use them to detect regressions, not to report absolute accuracy numbers to stakeholders.

Retrieval-Specific Metrics

Separately evaluate retrieval quality before generation quality. If retrieval is broken, there is no point optimizing prompts.

Hit rate — does the correct document appear in the top-k?
MRR (Mean Reciprocal Rank) — how high does the correct document rank?
NDCG — weighted ranking quality across the full result set.

Build a labelled retrieval eval set of at least 50–100 queries with known relevant documents. Run it on every significant pipeline change.

Architecture Decision Guide

Use this as a starting checklist when designing or auditing a RAG system.

Chunking

Chunk strategy respects semantic boundaries (not just token count)
Chunk size validated against your embedding model's context limit
Parent-child chunking considered for long documents
All chunks carry structured metadata

Retrieval

Hybrid search (dense + sparse) in place for keyword-sensitive queries
Re-ranker applied to candidate set before context assembly
Metadata filters scoped to reduce search space where possible
Top-k is dynamic and validated against available context budget

Generation

System prompt explicitly restricts generation to retrieved context
Token budget enforced before LLM call
"I don't know" path present when retrieval confidence is low
Citations or source references required in output

Evaluation

Retrieval eval set (hit rate / MRR) in CI pipeline
RAGAS or equivalent faithfulness metric tracked per release
LLM-as-judge regression tests on golden question set
Latency and cost tracked per query

Key Takeaways

Chunking quality determines retrieval quality. Fix chunking before touching anything else.
Hybrid search (dense + sparse + re-rank) is the current practical ceiling for retrieval quality without fine-tuning an embedding model.
HyDE, query expansion, and contextual compression are high-leverage techniques with minimal infrastructure cost.
Build your eval loop on day one. You cannot safely optimize a pipeline you cannot measure.
Token budget management is operational hygiene, not an optimization — missing it causes silent failures.

RAG done well is not a single model call — it is a data pipeline with a language model at the end. Treat it like one.

Why RAG beats fine-tuning for most use cases — The strategic case for RAG over retraining, and when fine-tuning is actually the right call.
Building a production LLM pipeline in 2025 — Broader lessons on taking LLM features from demo to production, including chunking, eval loops, and cost control.
Agent Reliability Blueprint: SLOs, Guardrails, and Human Override — When your RAG system is embedded in a larger agent, this is the reliability architecture it needs around it.

Building the Perfect RAG

Building the Perfect RAG

Table of Contents

What RAG Is (and Isn't)

The Anatomy of a RAG Pipeline

1. Chunking

2. Embedding

3. Vector Store

4. Retrieval

5. Generation

Common Pitfalls and How to Avoid Them

Bad Chunking Strategies

Poor Retrieval Quality

Context Window Mismanagement

Hallucinations Despite Retrieval

Advanced Techniques That Move the Needle

Hybrid Search

Re-ranking

HyDE (Hypothetical Document Embeddings)

Query Expansion and Decomposition

Metadata Filtering

Contextual Compression

Evaluation Strategies

RAGAS

LLM-as-Judge

Retrieval-Specific Metrics

Architecture Decision Guide

Key Takeaways

Related Posts