← ALL POSTS
RAGLLMAI

Building the Perfect RAG

Every RAG prototype works. Production is where pipelines break. A practical guide to chunking, retrieval, advanced techniques, and eval strategies that hold up under real load.

March 25, 202612 min read

Building the Perfect RAG

Every RAG prototype works. The embedding model is fast, the top-5 results look relevant, and the LLM strings together a coherent answer. Then you ship it, and reality arrives: answers that ignore the retrieved context, retrievals that return the wrong chunk from the right document, and a context window that quietly overflows and silently drops evidence.

RAG is not a solved problem you plug in — it is a pipeline you engineer. This post is a practical guide to building one that actually holds up in production.


Table of Contents

  1. What RAG is (and isn't)
  2. The anatomy of a RAG pipeline
  3. Common pitfalls and how to avoid them
  4. Advanced techniques that move the needle
  5. Evaluation strategies
  6. Architecture decision guide

What RAG Is (and Isn't)

Retrieval-Augmented Generation is a pattern: retrieve relevant documents from an external store, inject them into the LLM's context, then generate a grounded response. That's it.

What it is not: a magic accuracy layer you bolt onto any model. RAG amplifies the quality of your retrieval. Garbage in, garbage out — just faster and with a citation.

The value proposition is real though. As covered in Why RAG beats fine-tuning for most use cases, RAG gives you up-to-date, maintainable knowledge without retraining — which is why it has become the default architecture for production knowledge bases, support bots, and document Q&A systems.


The Anatomy of a RAG Pipeline

A production RAG pipeline has five moving parts. Each one has failure modes.

1. Chunking

Chunking converts raw documents into indexable units. Get this wrong and every downstream step suffers — retrieval returns the wrong neighborhood, the LLM sees truncated context, and semantic coherence collapses.

Naive fixed-size chunking is a trap. Splitting every 512 tokens without regard for structure severs sentences, splits tables, and buries headings in the wrong chunk.

Better strategies:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=128,           # ~25% overlap preserves cross-boundary context
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(docs)

Pro tip: Add document-level metadata to every chunk at index time — source URL, section title, creation date. You will need it for filtering and attribution later.

2. Embedding

Each chunk gets converted into a dense vector that encodes its semantic meaning. Similarity search happens in this vector space.

Model choices matter. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source models like bge-large-en-v1.5 or nomic-embed-text all have different strengths. The key variables:

from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=1024   # smaller than max (3072) — good balance of quality vs. cost
    )
    return [r.embedding for r in response.data]

3. Vector Store

The vector store holds your embeddings and serves approximate nearest-neighbor (ANN) queries. Common options:

For most production workloads: Qdrant or Weaviate if you self-host, Pinecone if you want zero infra overhead.

4. Retrieval

The retrieval step takes the user query, embeds it, and fetches the top-k most similar chunks. This is where most RAG systems silently fail.

Top-k is not a fixed constant. 3 is too few for multi-part questions. 20 is too many for a 4k context window. The right value depends on your chunk size, context window budget, and query complexity — and it should be tunable at runtime.

5. Generation

The LLM receives the query plus retrieved context and generates the final answer. Key decisions here:


Common Pitfalls and How to Avoid Them

Bad Chunking Strategies

The most common failure: chunks that cut across the very information the user is asking about. A table split in half, a code block severed at line 3, a numbered list where only items 1–4 land in the chunk.

Fix: inspect your chunks visually before indexing. Build a small script that prints 20 random chunks. If they look broken to you, they look broken to the retriever.

Poor Retrieval Quality

High cosine similarity does not mean high relevance. Two chunks can be semantically close in embedding space while one is completely off-topic in context.

Common causes:

Context Window Mismanagement

LLMs have hard token limits. If your retrieval returns 20 chunks of 512 tokens each, that is 10,240 tokens of context before the prompt, system message, or answer. You will either hit the limit silently (truncation) or throw an error.

Track token counts explicitly at runtime:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def build_context(chunks: list[str], max_tokens: int = 6000) -> str:
    context_parts = []
    used = 0
    for chunk in chunks:
        n = count_tokens(chunk)
        if used + n > max_tokens:
            break
        context_parts.append(chunk)
        used += n
    return "\n\n---\n\n".join(context_parts)

Hallucinations Despite Retrieval

RAG reduces hallucinations; it does not eliminate them. The model can still confabulate details not present in the retrieved context, especially if:

Mitigation: add a faithfulness check post-generation (more on this in the evaluation section), and consider adding an explicit "no-answer" path when retrieval confidence is low.


Advanced Techniques That Move the Needle

These are the techniques that separate a demo from a production system.

Pure vector search misses exact keyword matches. Hybrid search combines dense retrieval (semantic) with sparse retrieval (BM25 / TF-IDF) and blends the scores — typically via Reciprocal Rank Fusion (RRF).

# Qdrant hybrid search example
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector

results = client.query_points(
    collection_name="my_collection",
    prefetch=[
        {"query": dense_vector, "using": "dense", "limit": 20},
        {"query": SparseVector(indices=sparse_indices, values=sparse_values),
         "using": "sparse", "limit": 20},
    ],
    query={"fusion": "rrf"},
    limit=10,
)

Hybrid search reliably outperforms either approach alone, especially for product names, codes, and proper nouns that embedding models tend to blur together.

Re-ranking

Retrieve a large candidate set (top-20 or top-50), then run a cross-encoder re-ranker over the candidates to re-score them. Cross-encoders are slower than bi-encoders but far more accurate because they see the query and document together.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[str], top_k: int = 5) -> list[str]:
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

Cohere's Rerank API offers a strong managed alternative if you do not want to host a model.

HyDE (Hypothetical Document Embeddings)

A clever trick: instead of embedding the user's raw query, ask the LLM to generate a hypothetical document that would answer the query, then embed that hypothetical document. Because it looks like an answer rather than a question, it sits closer in embedding space to your actual answer chunks.

def hyde_embed(query: str) -> list[float]:
    hypothetical_doc = llm.complete(
        f"Write a short paragraph that directly answers this question:\n{query}"
    )
    return embed([hypothetical_doc])[0]

HyDE is particularly effective for technical Q&A where queries are terse ("how does X work?") but documents are verbose.

Query Expansion and Decomposition

Multi-hop questions ("Which customers from the 2023 cohort upgraded to the enterprise plan?") are a single embedding query's nightmare. Decompose them:

sub_queries = llm.complete(
    f"Break this question into 2-4 simpler sub-questions:\n{query}"
)
# Run retrieval for each sub-query, deduplicate, then generate

Similarly, expanding a short query into 3–5 paraphrases and union-ing the results reduces sensitivity to exact wording.

Metadata Filtering

Do not retrieve from your entire corpus if you do not have to. Metadata filters (date range, document type, user permission scope, product area) drastically shrink the search space before vector similarity runs, improving both speed and relevance.

results = vector_store.search(
    query_vector=embed(query),
    filter={"source": "support_tickets", "date": {"gte": "2025-01-01"}},
    top_k=10
)

Contextual Compression

Return only the relevant sentences from each retrieved chunk, not the whole chunk. This keeps context tight and reduces the "lost in the middle" effect.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

Evaluation Strategies

You cannot improve what you do not measure. Build an eval loop before you optimize anything.

RAGAS

RAGAS is the standard open-source framework for RAG evaluation. It measures four dimensions without requiring ground-truth labels for every question:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,   # HuggingFace Dataset with question/answer/contexts
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)

LLM-as-Judge

For qualitative evaluation at scale, use a capable model (GPT-4o, Claude Opus) to score responses on correctness, completeness, and tone. Define a rubric, enforce it via structured output, and aggregate scores.

Note: LLM judges are consistent but not perfectly calibrated. Use them to detect regressions, not to report absolute accuracy numbers to stakeholders.

Retrieval-Specific Metrics

Separately evaluate retrieval quality before generation quality. If retrieval is broken, there is no point optimizing prompts.

Build a labelled retrieval eval set of at least 50–100 queries with known relevant documents. Run it on every significant pipeline change.


Architecture Decision Guide

Use this as a starting checklist when designing or auditing a RAG system.

Chunking

Retrieval

Generation

Evaluation


Key Takeaways

RAG done well is not a single model call — it is a data pipeline with a language model at the end. Treat it like one.


← BACK TO ALL POSTS