Won't adding context to every prompt increase our token bill?

It adds a few thousand tokens per run — and removes far more by cutting the back-and-forth the agent would otherwise spend reconstructing what changed. On most top-tier models the net is a lower cost per bug caught; on a few it is a small per-run increase that is still cheaper per outcome. Either way you come out ahead on the metric that matters: dollars per bug actually fixed.

How fast do we see ROI?

It pays for itself the first time it catches a bug that would have shipped. A single silent regression costs roughly $1,500 in engineering time to fix after release, and a customer-visible incident has been documented as high as $50,000 in lost revenue. The cost of running ReGrade is negligible by comparison — on most top-tier models it actually lowers total spend, cutting the cost per bug caught by 23–89%.

Does this lock us into one model or vendor?

No. The effect held across all twelve labs and every capability tier we tested, US and non-US. The behavioral diff sits alongside whatever coding agent and model you already use, so you can switch models without re-tooling.

How is this different from just setting a spend cap?

A cap limits how much work your team can do. This lowers the cost of the work itself — you get the same output for less, rather than less output for less. They are complementary, but only one of them improves quality at the same time.

The Most Expensive Thing Your AI Does Is Re-Read and Rework Its Own Code

Short answer for the budget owner: Your AI coding agents bill by the token, and a large share of that bill pays for the same work over and over — the agent re-reading files and re-running commands to reconstruct what just changed. The lever that actually bends the curve is not a cheaper model or a usage cap. It is giving the agent the answer up front. In a study spanning 28 models, 12 labs, and more than 1,700 trials, doing exactly that cut the cost per bug caught by 23–89% on most top-tier models — while catching more bugs, not fewer.

The bill is coming due

This spring the numbers stopped being abstract. TechCrunch's The token bill comes due opens with a CTO describing an engineer who "spent $40,000 on tokens last month." Per-developer consumption is up roughly 18.6× in nine months. Uber burned through its entire 2026 AI coding budget by April. One company woke up to a $500 million Claude bill after forgetting to set usage limits. Microsoft revoked its own developers' Claude Code licenses. Fortune's framing is blunter still: using the tech has become more expensive than paying human employees, with Nvidia's Bryan Catanzaro noting that for his team "the cost of compute is far beyond the costs of the employees."

This is not a handful of outliers. Goldman Sachs projects token consumption will multiply 24× by 2030, and Meta employees burned through 73.7 trillion tokens in a single month — racing up an internal leaderboard — until the runaway cost forced the company to start reining usage in. For a growing number of organizations, the AI coding bill is now the second-largest line on the engineering ledger, right behind salaries. And unlike salaries, nobody can tell you in advance what it will be.

Why agents are structurally expensive

A chatbot answers once. An agent works in a loop — and every call in that loop is stateless. The model remembers nothing between steps, so on each one the agent re-sends the entire history: the system prompt, the tool definitions, every previous result, all of its own reasoning. As LeanOps lays out, a five-step task already costs about three times as much as a single chat, a fifty-step task around thirty times as much, and a long autonomous run can cost more than a hundred times as much. That is how one developer's weekend refactor ran up $4,200 in API fees, and how a 35-person shop landed an $87,000 monthly bill. Vantage's teardown of agentic coding costs shows the same dynamic from the billing side: by turn 30, a session can be carrying 25,000–35,000 accumulated input tokens on every request.

Here is the part that matters for the budget: a great many of those steps are the agent rediscovering context it could simply have been handed. It opens a file, reads it, runs a command, reads the output, opens another file — paying for tokens on every round trip — to answer one question: what is this code actually doing differently now? You are not just paying for the fix. You are paying, repeatedly, for the search that precedes it.

The instinct is to pull the wrong lever

When the bill spikes, the reflex is to cap usage or drop to a cheaper model. Both trade the token bill for something more expensive.

A usage cap throttles the very work your team adopted the tools to do. And a cheaper model ships more bugs — the costliest outcome of all, because a defect that reaches production costs far more than the tokens you saved getting it there. The savings are an illusion: you have moved the cost off the API invoice and into the incident channel, where it is larger and harder to trace.

The right lever: pay for outcomes, not rediscovery

The cheaper move is to stop paying the agent to rediscover what changed. Hand it a behavioral diff — a concise, structured account of how the new code's runtime behavior differs from the old — at the start, instead of making it reconstruct that picture one expensive tool call at a time.

That is what Curtail's ReGrade produces. It replays real traffic against the old and new versions of a service, compares what they actually do, and gives the reviewer — human or AI — the delta up front. The agent stops flailing and starts fixing.

What we measured

We tested this across 28 models from 12 labs, in 1,700+ agent-mode trials, each one fixing 18 planted regressions inside a 300,000-line codebase. With the behavioral diff added to the agent's prompt:

The agent reached its answer with 20–84% less wall-clock time per trial — far fewer of those expensive round trips.
On most top-tier models, the cost per bug caught fell 23–89%.
On several models, token spend per trial actually dropped while bugs caught went up — cheaper and better at the same time.

And this isn't just propping up weak models. The cheaper agents — already burning tokens to flail — saved the most, but the most telling result was at the very top: even Anthropic's Mythos-class Fable 5, one of the most advanced models we tested, left bugs on the table on its own — and ReGrade took it to a perfect 18/18. No model is good enough to skip this.

Cheaper is only the start — it catches more, too

The efficiency is the budget story; the quality is why it pays off. Across the same lineup the behavioral diff lifted bug-fix rates on every model in our core group — by +1.7 to +13 additional bugs out of 18, taking the strongest models to a flawless 18/18. And it caught the defects conventional testing structurally can't: in our benchmark, 35% of AI patches that passed every functional test still silently broke behavior, and multi-file regressions that weaker models never caught on their own, the diff caught every time. It even sharpened the fixes — on a 291,000-line production codebase, nine of ten models pinpointed the bug more precisely with the diff in hand. Finding the bug is half the job; explaining it well enough that a human trusts the fix is the other half.

The predictability dividend

For anyone forecasting a budget, an unpredictable bill is worse than a high one. The behavioral diff helps here too. On several models — Sonnet 4.6, Opus 4.7, GPT-5.5, Qwen3.6-Plus — every trial produced the identical outcome once the agent had the diff in hand. Run-to-run randomness went to zero: same input, same spend, every time.

That turns AI coding from a number you hope stays in range into a line item you can actually plan around — and it quietly kills the flaky-CI reruns that double the bill without anyone noticing.

The other side of the ledger

Token savings are only half the return. The bugs this catches are silent regressions — shifted response shapes, drifted headers, error envelopes that look right but are not — the kind that pass every test and surface in production. And AI-generated code ships more of them: CodeRabbit's analysis of real pull requests found AI-written code carries roughly 1.7× more defects than human-written code, and Lightrun's 2026 survey found 43% of AI-generated changes still need debugging in production after passing QA. Industry data puts the cost of fixing a bug after it ships at around 30× what it costs to catch it early — on the order of $1,500 in engineering time for a behavioral regression — while a single customer-visible incident has been documented at $50,000 in lost revenue. An agent that ships those bugs cheaply was never actually cheap.

The honest version

We are not promising a magic token diet. On a couple of models the per-trial cost actually ticks up slightly, because the diff rides along in cached context. The claim we will stand behind is cost per outcome: across the lineup, the cost of each bug you actually catch drops sharply, because the agent spends its tokens on the fix instead of the search. And this is not one vendor's quirk — the effect held across every lab and capability tier we tested, US and non-US alike.

The takeaway

If AI coding spend has become a line item you are being asked to explain, the lever that controls it is the same one that improves quality: give your agents the answer instead of paying them to go find it. You spend less, you ship fewer bugs, and — for the first time — you can predict the bill.

The benchmark behind these numbers — DriftBench, our open-source evaluation suite — is public, the full study with all 28 models and per-model figures is available on request, and we run pilots against your own codebase.

Talk to us about a pilot → · curtail.com

Sources & further reading

On the rising cost of AI coding

TechCrunch — The token bill comes due: inside the industry scramble to manage AI's runaway costs
Fortune — Microsoft reports are exposing AI's real cost problem
SmarterX — Uber, Microsoft, and others burning through AI budgets (with Goldman's 24× token projection)
The Decoder — Meta shifts from "tokenmaxxing" to token managing as internal AI costs hit billions
Vantage — The Hidden Cost Driver in Agentic Coding Sessions in 2026
LeanOps — AI Agents Burn 50× More Tokens Than Chats

On AI code quality and the cost of shipped bugs

CodeRabbit — 2025 was the year of AI speed; 2026 will be the year of AI quality
Lightrun via VentureBeat — 43% of AI-generated code changes need debugging in production
BetterQA — Cost of fixing bugs by SDLC stage: why production is 30×
TrackJS — The Hidden Cost of Silent API Failures in Production

The study behind ReGrade's numbers

DriftBench — open-source evaluation suite (GitHub)