Building a production LLM pipeline in 2025

From demo to production: the gap is real

Every LLM demo works. The data is curated, the queries are predictable, and the context window is never exceeded. Production is different.

Users ask unexpected questions. Documents are messy. The context window fills up. Costs spiral. And suddenly your 95% accuracy demo is a 70% accuracy embarrassment.

Here's what I've learned shipping LLM features to real users.

1. Chunking strategy matters more than model choice

Most teams spend time picking the right model. They should spend time picking the right chunking strategy.

Naive fixed-size chunking (every 500 tokens) breaks semantic units and destroys retrieval quality. Instead:

Semantic chunking — split on paragraph or section boundaries
Overlapping windows — 20% overlap between chunks to preserve context
Hierarchical — store both summary chunks and detail chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " "]
)

2. Build an eval loop before you optimize

You can't improve what you can't measure. Before tuning prompts or swapping models, build a simple eval:

Collect 50–100 real user queries
Write expected outputs (or use an LLM as judge)
Score every pipeline change against this set

Without this, you're flying blind and optimizing for vibes.

3. Cost control from day one

GPT-4 at scale is expensive. Structure your pipeline to use expensive models sparingly:

Route simple queries to a cheap model (Haiku, Flash)
Reserve expensive models for complex reasoning
Cache frequent queries — most production traffic repeats

4. Observability is non-negotiable

Log every LLM call: input, output, latency, token count, cost. Use LangSmith, Langfuse, or a custom solution. When something breaks at 3am, you'll thank yourself.

Shipping is the start, not the finish.