Why Gemini 3.5 Pro Got Delayed: The Real Cost of Long-Horizon Agent Tasks
At Google I/O on May 19, 2026, Sundar Pichai told the world Gemini 3.5 Pro would ship in June. June came and went. No model.
It's now expected in July instead. Google didn't cite a capability gap, a safety review, or an infrastructure problem — the stated reason was that enterprise testers on extended agentic workloads flagged the model burning through too many tokens on long, multi-step tasks, and Google chose to fix that before shipping.
That sounds like a footnote. It isn't. "Uses too many tokens on long tasks" is a genuinely hard problem in production LLM engineering, and it has almost nothing to do with the model being chatty. Here's what's actually going on, and why holding a launch to fix it is a sign of attention, not delay.
What Google Actually Said
The headlines compressed a few weeks into "Gemini 3.5 Pro delayed." The actual sequence is narrower:
- May 19, 2026 — At Google I/O, Pichai announces Gemini 3.5 Pro, targeting a June 2026 ship date.
- June 2026 — The promised date passes with no release.
- The stated reason — Enterprise testers on extended agentic workloads reported the model consuming more tokens than expected on long, multi-step tasks.
- New target — July 2026, the extra weeks spent on that token-consumption problem specifically, not new features.
That's not "the model isn't smart enough." It's "the model costs too much for the tasks enterprises actually want it for" — different problems, and the second is arguably harder to fix, since you can't prompt your way out of it.
"Uses Too Many Tokens" Is Not a Vague Complaint
It's tempting to read "excessive token consumption" as a synonym for verbose — like the model writes longer paragraphs than it needs to. That's not what's being described. In an agent loop, token cost comes from three things stacking on top of each other:
- Long tool-call chains — real work rarely finishes in one shot. The agent searches, reads a result, calls another tool, reads that result. Each round trip is a full model call.
- Repeated context re-processing — most agent frameworks resend the running transcript, every prior tool call and result, so the model has full history. That transcript never shrinks.
- Verbose reasoning carried forward — a model "thinking out loud" before acting, exactly what makes it good at multi-step tasks, adds that reasoning to the transcript too, billed again on every later call whether or not it's still relevant.
None of these are bugs. They're the direct cost of giving a model enough context to act correctly — the problem is what happens once all three stack across a long-running task.
Two of these four blocks are fixed size. Two of them get bigger every single step — and never shrink back down.
Why This Compounds Badly at Agent Scale
Here's what makes this a real production problem: in a single chat turn, a chatty model is a rounding error — a longer answer, a few hundred extra tokens. An agent loop is many calls, and because of the re-sent transcript, each one costs more than the last. A 10-step agent task doesn't cost 10x a chat turn — it costs more, because step 7 has to re-read everything steps 1 through 6 produced.
Run the arithmetic: a chat turn processes around 1,000 tokens. A 10-step task where each step adds roughly 500 tokens of new transcript doesn't cost 10,000 tokens total — it costs closer to 32,500, because the growing transcript gets billed again, in full, every step.
Step 1 costs about the same as a chat turn. Step 10 costs over five times as much — same model, same task type, just later in the loop.
Scale that to a real "extended agentic task" — 30, 50, sometimes hundreds of steps for a multi-file refactor or a long research run — and the gap between flat per-step cost and actual compounding cost becomes the line item that decides whether the workflow is affordable at all. It's the same cost math covered in more depth in Agent Reliability Blueprint, where token budgets are a first-class production constraint, not an afterthought.
Why Delaying the Launch Is the Right Call
It would have been easy to ship on the promised date and let customers discover the token math on their own invoice. By every indication the model was capable enough — what wasn't ready was the cost profile for long-horizon agentic tasks, the exact use case Google has pitched as its headline draw.
Holding the launch is a real trade-off: a news cycle, a window for competitors to make noise. But shipping a model that's capable yet economically unworkable for its flagship use case would cost more later, in trust, once the invoices landed. Eating bad headlines instead of shipping bill shock is the boring, correct call.
It's not happening in a vacuum. Through 2026, the industry has visibly shifted from pure capability-chasing — bigger benchmarks, more "tokenmaxxing" — toward efficiency. Teams have reportedly switched providers and models specifically to cut token overhead on agentic workloads, not because the old model got dumber, but because the cost-per-task math stopped working. Efficiency on long-horizon tasks is becoming a competitive axis of its own, not a footnote.
What This Means If You're Building on Gemini, or Anything Else
A few practical takeaways if you're running or evaluating agent workloads:
- Measure cost per completed task, not per call — a cheap per-call price can still produce an expensive task once compounding re-sends are factored in.
- Audit your context window at each step. A framework that blindly appends the full transcript forever bills you for it, needed or not.
- Look for context pruning or summarization that collapses old tool output and reasoning once it's no longer load-bearing.
- Benchmark long-horizon cost specifically, not just short-turn latency — cheap at five turns can look different at 50.
- Treat a vendor delaying for cost reasons as a positive signal — it means someone looked at real usage data.
Key Takeaways
- Gemini 3.5 Pro was promised for June 2026 at Google I/O (May 19, 2026) and slipped to July because enterprise testers flagged excessive token consumption on extended agentic tasks — a cost problem, not a capability problem.
- "Too many tokens" comes from three stacking sources: long tool-call chains, repeated re-processing of the growing transcript, and verbose reasoning carried into every future call.
- A chatty model is a minor cost issue in one chat turn but compounds across a multi-step loop, since later steps re-read everything earlier steps produced.
- A 10-step agent task can cost several times more than a naive "10x a single call" estimate, purely from resent context that keeps growing.
- Delaying a launch to fix agent-scale token economics is a defensible engineering call, not evidence the model wasn't ready.
- The industry is shifting from capability-chasing toward efficiency on long-horizon agentic workloads — expect that to stay a differentiator through 2026.
Related Posts
- Agent Reliability Blueprint: SLOs, Guardrails, and Human Override — token budgets as a first-class production constraint for agents.
- Repo-Level AI Agents: How Coding Assistants Learned to Reason Across a Whole Codebase — the search → plan → act → verify loop, where every cycle adds to the bill.
- GPT-5.6's Three Tiers: Sol, Terra, and Luna, Simply Explained — a competing model family's own cost and tier tradeoffs.
- Stop Fine-Tuning GPT-5. A 7B Open-Source Model Will Beat It on Your Use Case — matching model size and cost to task difficulty.