Every team picking an AI stack in 2026 asks the same question: which model do we use?
Most of them answer it the same way. They open the leaderboard, find the highest-ranked model they can afford, build a prototype against it, ship, then spend the next six months trying to claw back margin. By that point the architecture is frozen around the expensive choice.
That process is backwards. Here is a better one.
The Usual Decision Process, and Why It Fails
The typical model-selection flow looks like this: pick the most capable model you can demo well → get approval → discover cost and latency in production → optimize.
The problem is that "optimize" at that stage means swapping out a load-bearing dependency after the product is already running. You don't have evals — you built against a model that was good enough to skip them. You don't have a fallback path. You have a refactor.
The other failure mode is subtler: teams that start at GPT-5 never learn what their task actually requires. They don't know if 2B parameters would have worked, because they never tried. The model's capability masks their own ignorance of the problem.
The Framework: Start Small, Earn Big
The right starting position is the bottom rung.
The default starting position is the bottom rung. Earn your way up.
This is not a cost-optimization strategy. It is an information strategy.
Starting small forces you to write evals immediately — because you need a signal for when the small model fails. That eval set is the most valuable artifact your AI team will produce. Teams that start at GPT-5 never build it. They ship on vibes and call the model "good enough."
In 2026 this approach is credible in a way it wasn't two years ago. Phi-3 Mini (3.8B parameters) outperforms GPT-3.5 on most structured reasoning benchmarks. Gemma 2B runs on a phone with sub-100ms latency. The capability floor has risen far enough that "start small" is no longer a compromise — it is a reasonable default for most production workloads.
What "Graduating" Actually Looks Like
You need two things before you graduate a rung: an eval set, and a measurable failure on it.
# Benchmark harness: run the same task against two models, compare on labeled data
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
def run_eval(model: str, test_cases: list[dict]) -> dict:
correct = 0
for case in test_cases:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": case["prompt"]}],
temperature=0,
)
prediction = response.choices[0].message.content.strip()
if prediction == case["expected"]:
correct += 1
accuracy = correct / len(test_cases)
return {"model": model, "accuracy": accuracy, "n": len(test_cases)}
with open("eval_set.json") as f:
test_cases = json.load(f)
small = run_eval("phi3:mini", test_cases)
large = run_eval("llama3:8b", test_cases)
print(f"{small['model']}: {small['accuracy']:.2%}")
print(f"{large['model']}: {large['accuracy']:.2%}")
# Graduate only if delta justifies the latency and cost increase.
Run this before you touch a frontier model. If the small model hits your accuracy threshold, you are done. If it does not, you have a principled reason to climb — and a number to beat at the next rung.
The Production Pattern: Route, Don't Replace
Once you accept that different requests have different complexity, you stop thinking in terms of one model per product and start thinking in terms of routing.
# Model router: try small model first, escalate only on low-confidence output
from openai import OpenAI
import re
SMALL_MODEL = "phi3:mini"
LARGE_MODEL = "llama3:8b"
CONFIDENCE_THRESHOLD = 0.85
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
def extract_confidence(text: str) -> float:
"""Parse a confidence score the model appends to its output."""
match = re.search(r"confidence:\s*(0\.\d+|1\.0)", text, re.IGNORECASE)
return float(match.group(1)) if match else 0.0
def route(prompt: str) -> tuple[str, str]:
system = "Answer the question. Append 'Confidence: <0.0-1.0>' on the last line."
def call(model):
return client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system},
{"role": "user", "content": prompt}],
temperature=0,
).choices[0].message.content.strip()
response = call(SMALL_MODEL)
if extract_confidence(response) >= CONFIDENCE_THRESHOLD:
return response, SMALL_MODEL
response = call(LARGE_MODEL)
return response, LARGE_MODEL
answer, model_used = route("Summarize the key risks in this contract clause: ...")
print(f"[{model_used}] {answer}")
Note: Confidence self-reporting by LLMs is imperfect. Calibrate the threshold against your eval set, not intuition. This pattern works best when the small model's failure mode is silence or hedging rather than confident hallucination.
The routing pattern is why latency and privacy advantages of small models matter now. The small model handles 70–80% of your traffic locally, sub-100ms, with no data leaving your infrastructure. The large model sees only the hard cases. Regulated industries get this for free as a side effect of the architecture.
The One Rule
Start at the bottom. Climb only when your eval set forces you to. Every rung you do not climb is cost, latency, and complexity you do not carry.
The question is not "which model do we use?" The question is "what is the minimum model that passes our evals?" You cannot answer that without evals. So write the evals first.
Related Posts
- Stop Fine-Tuning GPT-5. A 7B Open-Source Model Will Beat It on Your Use Case — when fine-tuning a small model beats prompt-engineering a large one
- DeepSeek Changed Everything — why algorithmic efficiency matters more than raw compute
- Why RAG Beats Fine-Tuning for Most Use Cases — choosing the right adaptation strategy before choosing model size
- The State of AI Benchmarks in 2026 — why leaderboard rankings are the wrong input to model selection
Building a model selection framework at your org? The eval harness above is the starting point. What does your routing threshold actually look like in production — drop it in the comments.