The Research
Catch the bugs your tests can't see.
AI writes a third of new production code. The bugs it ships are the silent ones.
ReGrade pays for itself catching the first bug. ReGrade catches +4 to +13 extra bugs per trial across 16 of 17 LLMs tested, at a marginal cost between negative (it pays for itself) and ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.
+4 to +13
extra bugs caught per trial, across 16 of 17 LLMs
+825%
biggest single-model improvement (Haiku 4.5)
18/18
bugs fixed by every top-tier model with ReGrade
σ→0
same answer every run on Sonnet, Opus, GPT-5.5, Qwen3.6-Plus
200
trials in our pre-registered head-to-head test
17 / 6
LLMs / AI providers tested
📊 The test, by the numbers.
ReGrade is a behavioral-diff context block your AI coding agent reads alongside its existing prompt — it shows the agent how the new code's runtime behavior differs from the old, the way a human checks for regressions at code review. No model retraining. No changes to your build or CI pipeline.
What we tested. Each trial = one AI coding agent (running autonomously, the way Claude Code or Codex CLI operate — shell tools, file access, code edits) attempts to fix 18 known bugs planted in a 300,000-line Python codebase. 1,400+ trials across 17 LLMs from 6 AI providers.
17
LLMs
6
AI providers
1,400+
trials
18
known bugs
300,000-line Python
codebase
+825%
biggest gain
🛠️ Works with every major coding agent and model API.
3 agent CLIs tested · 17 LLMs across 6 providers.
Coding-agent CLIs
Claude Code
- Haiku 4.5
- Sonnet 4.6
- Opus 4.6
- Opus 4.7
Codex CLI
- GPT-5.4
- GPT-5.5
qwen-code
- Qwen3-Coder
- Qwen3-Coder-Plus
- Qwen3.6-Plus
- Qwen3.6-Max-Preview
Models tested via API
Gemini
- Gemini 3 Flash Preview
- Gemini 3.1 Pro Preview
Grok
- Grok 3 Mini
- Grok 3 Fast
- Grok 4.20 Reasoning
DeepSeek
- V4 Pro
- V4 Flash
One ReGrade context block in the agent's prompt. No model retraining. No changes to your build or CI pipeline. Tested working across all 6 AI providers.
🎯 Every model improves. No exceptions.
+33% to +825% bug-fix improvement across the lineup. Without ReGrade, the best model fixes only 13.3 / 18. With ReGrade, four models hit the ceiling and every other one moves up.
| Provider | Model | Without | With | Δ |
|---|---|---|---|---|
| Anthropic | Opus 4.7 | 9.1 / 18 | 18.0 / 18 | +98% |
| Anthropic | Sonnet 4.6 | 5.5 / 18 | 17.7 / 18 | +222% |
| OpenAI | GPT-5.5 | 12.3 / 18 | 18.0 / 18 | +46% |
| OpenAI | GPT-5.4 | 13.3 / 18 | 18.0 / 18 | +35% |
| DeepSeek | DeepSeek V4 Pro | 5.3 / 18 | 16.2 / 18 | +206% |
| xAI | Grok 4.20 Reasoning | 3.7 / 18 | 15.7 / 18 | +324% |
| Gemini 3.1 Pro Preview | 8.7 / 18 | 14.3 / 18 | +64% | |
| Alibaba | Qwen3.6-Max-Preview | 5.7 / 18 | 12.8 / 18 | +125% |
| Moonshot ★ | Kimi K2.6 (native CLI) | 6.2 / 18 | 16.1 / 18 | +160% |
Anthropic + OpenAI numbers from a length-controlled head-to-head extended to n=30 trials per model in v5.5. Other providers from the broader lineup test (n=3-4 trials per model). ★ Kimi K2.6 is a v6 preliminary cell at n=8 with engaged-subset means.
✅ The gains are real — proven, not assumed.
In a 600-trial head-to-head test whose design we registered in advance (so the results couldn't be cherry-picked after the fact), ReGrade beat all three comparison conditions — across every model tested.
| Control condition | Bugs fixed | vs ReGrade |
|---|---|---|
| No extra context (just the test suite) | 8.6 / 18 | 16.9 / 18 |
| Random gibberish, same length as ReGrade | 8.7 / 18 | 16.9 / 18 |
| Unrelated source code, same length as ReGrade | 8.4 / 18 | 16.9 / 18 |
What moves the needle is the behavioral signal in ReGrade, not the volume of extra context. Less than 1-in-a-billion-trillion chance the difference is a fluke.
⚡ Speed: 20–84% less real-world time per trial.
The ReGrade context cuts down on the back-and-forth where the agent runs shell commands and re-reads files trying to figure out what changed — the agent gets to the answer faster.
| Provider | Model | Without | With | Speed-up |
|---|---|---|---|---|
| Alibaba | Qwen3.6-Max-Preview | 50m 7s | 8m 4s | 84% faster |
| Gemini 3.1 Pro Preview | 10m 22s | 6m 46s | 35% faster | |
| OpenAI | GPT-5.4 | 4m 1s | 2m 48s | 30% faster |
| OpenAI | GPT-5.5 | 5m 17s | 4m 13s | 20% faster |
| Alibaba | Qwen3.6-Plus | 25m 26s | 21m 25s | 16% faster |
Bonus: in four models (Sonnet 4.6, Opus 4.7, GPT-5.5, Qwen3.6-Plus) every trial produces the identical outcome — trial-to-trial randomness drops to zero. No more flaky CI retries.
💰 Cost per bug drops 23–89% on top-tier models.
Adding ReGrade reduces LLM input/output cost per trial AND finds more bugs, so the total cost per bug fixed is meaningfully lower with ReGrade than without it on most top-tier models. Even on the four models where it doesn't pay for itself outright, the marginal cost stays under ≤16¢ per extra bug caught — versus an industry-typical $1,500–$50,000 per bug that ships to production.
| Provider | Model | Without $/bug | With $/bug | Change |
|---|---|---|---|---|
| Alibaba | Qwen3.6-Max-Preview | $0.333 | $0.036 | -89% |
| xAI | Grok 4.20 Reasoning | $0.235 | $0.077 | -67% |
| OpenAI | GPT-5.5 | $0.156 | $0.091 | -42% |
| Anthropic | Sonnet 4.6 | $0.126 | $0.092 | -27% |
| OpenAI | GPT-5.4 | $0.020 | $0.015 | -25% |
| Anthropic | Opus 4.7 | $0.257 | $0.199 | -23% |
Six of eight top-tier models get cheaper per bug with ReGrade, with reductions from −23% to −89%.
🏆 Three recommended models, ranked by cost with ReGrade.
For teams choosing their AI coding agent today: best results-per-dollar with ReGrade.
🥇
Gemini 3 Flash Preview
17.7 / 18
bugs fixed
$0.008
per bug
🥈
GPT-5.4
17.7 / 18
bugs fixed
$0.015
per bug
🥉
DeepSeek V4 Pro
16.2 / 18
bugs fixed
$0.017
per bug
Five more model + ReGrade combinations clear the cheap-AND-effective threshold. Works equally well across providers.
ReGrade is a single addition to the agent's prompt — you pay only the marginal LLM tokens for that context, with no separate ReGrade fee per bug.
Want ReGrade for your AI coding agent?
Drop the context block into your agent's prompt. No retraining. No CI changes.