← ALL POSTS
GoogleGeminiAIAgentsEngineering

Why Gemini 3.5 Pro Got Delayed: The Real Cost of Long-Horizon Agent Tasks

Sundar Pichai promised Gemini 3.5 Pro for June 2026. It slipped to July because enterprise testers flagged excessive token burn on long agentic tasks. Simply explained: why that's a genuinely hard engineering problem, and why delaying to fix it was the right call.

July 3, 20267 min read

Why Gemini 3.5 Pro Got Delayed: The Real Cost of Long-Horizon Agent Tasks

At Google I/O on May 19, 2026, Sundar Pichai told the world Gemini 3.5 Pro would ship in June. June came and went. No model.

It's now expected in July instead. Google didn't cite a capability gap, a safety review, or an infrastructure problem — the stated reason was that enterprise testers on extended agentic workloads flagged the model burning through too many tokens on long, multi-step tasks, and Google chose to fix that before shipping.

That sounds like a footnote. It isn't. "Uses too many tokens on long tasks" is a genuinely hard problem in production LLM engineering, and it has almost nothing to do with the model being chatty. Here's what's actually going on, and why holding a launch to fix it is a sign of attention, not delay.


What Google Actually Said

The headlines compressed a few weeks into "Gemini 3.5 Pro delayed." The actual sequence is narrower:

That's not "the model isn't smart enough." It's "the model costs too much for the tasks enterprises actually want it for" — different problems, and the second is arguably harder to fix, since you can't prompt your way out of it.


"Uses Too Many Tokens" Is Not a Vague Complaint

It's tempting to read "excessive token consumption" as a synonym for verbose — like the model writes longer paragraphs than it needs to. That's not what's being described. In an agent loop, token cost comes from three things stacking on top of each other:

None of these are bugs. They're the direct cost of giving a model enough context to act correctly — the problem is what happens once all three stack across a long-running task.

Four connected cards showing what gets resent in every agent step: fixed system prompt and tool definitions, a growing transcript of the task so far, growing prior reasoning traces, and the new tool call and result — with a loop-back arrow showing the transcript and reasoning are appended next step, not replaced. Two of these four blocks are fixed size. Two of them get bigger every single step — and never shrink back down.


Why This Compounds Badly at Agent Scale

Here's what makes this a real production problem: in a single chat turn, a chatty model is a rounding error — a longer answer, a few hundred extra tokens. An agent loop is many calls, and because of the re-sent transcript, each one costs more than the last. A 10-step agent task doesn't cost 10x a chat turn — it costs more, because step 7 has to re-read everything steps 1 through 6 produced.

Run the arithmetic: a chat turn processes around 1,000 tokens. A 10-step task where each step adds roughly 500 tokens of new transcript doesn't cost 10,000 tokens total — it costs closer to 32,500, because the growing transcript gets billed again, in full, every step.

Bar chart comparing token cost of a single chat turn against 10 steps of an agent loop, with bars growing in height from step 1 through step 10 as the resent transcript accumulates, alongside a dashed reference line showing what cost would look like if it stayed flat per step. Step 1 costs about the same as a chat turn. Step 10 costs over five times as much — same model, same task type, just later in the loop.

Scale that to a real "extended agentic task" — 30, 50, sometimes hundreds of steps for a multi-file refactor or a long research run — and the gap between flat per-step cost and actual compounding cost becomes the line item that decides whether the workflow is affordable at all. It's the same cost math covered in more depth in Agent Reliability Blueprint, where token budgets are a first-class production constraint, not an afterthought.


Why Delaying the Launch Is the Right Call

It would have been easy to ship on the promised date and let customers discover the token math on their own invoice. By every indication the model was capable enough — what wasn't ready was the cost profile for long-horizon agentic tasks, the exact use case Google has pitched as its headline draw.

Holding the launch is a real trade-off: a news cycle, a window for competitors to make noise. But shipping a model that's capable yet economically unworkable for its flagship use case would cost more later, in trust, once the invoices landed. Eating bad headlines instead of shipping bill shock is the boring, correct call.

It's not happening in a vacuum. Through 2026, the industry has visibly shifted from pure capability-chasing — bigger benchmarks, more "tokenmaxxing" — toward efficiency. Teams have reportedly switched providers and models specifically to cut token overhead on agentic workloads, not because the old model got dumber, but because the cost-per-task math stopped working. Efficiency on long-horizon tasks is becoming a competitive axis of its own, not a footnote.


What This Means If You're Building on Gemini, or Anything Else

A few practical takeaways if you're running or evaluating agent workloads:


Key Takeaways


← BACK TO ALL POSTS