Skip to main content
Curtail
Back to News
·Skyler Lister-Aley

How to Catch AI-Generated Code Regressions Before They Ship

AI-generated code is exceptionally good at passing tests and exceptionally bad at preserving behavior tests don't assert against. This is how to catch the runtime regressions your CI is silently letting through.

ai-coderegression-testingciruntime-verificationregrade

Short answer: Record the HTTP/API traffic your existing tests already generate, replay that same traffic against the old and new versions of your service, and compare every field of every response. Any difference — status code, body field, header, latency, downstream call — is a behavioral regression worth investigating. This is called runtime verification, and it catches the class of bug AI-generated code produces most often: changes that compile, pass tests, and silently alter what your API actually does.


The specific failure mode that AI-generated code keeps producing

Here is a real shape of bug we see again and again. A developer asks Copilot or Cursor to "clean up" a list endpoint. The before:

@router.get("/users")
def list_users(active_only: bool = False) -> dict:
    users = db.users.find()
    if active_only:
        users = [u for u in users if u.is_active]
    return {"users": [u.to_dict() for u in users], "count": len(users)}

The AI rewrites it for "consistency with REST conventions":

@router.get("/users")
def list_users(active_only: bool = False) -> list[dict]:
    users = db.users.find()
    if active_only:
        users = [u for u in users if u.is_active]
    return [u.to_dict() for u in users]

The change is defensible. Many style guides prefer bare arrays for collection endpoints. The tests pass because they were written against the function return value, which the AI also updated. The pull request looks clean.

But this endpoint has 14 downstream consumers, and three of them parse response.json()["users"]. Those three will throw KeyError against the new shape the first time they run. The CI pipeline has no idea. Nothing it can see, at the layer it's looking, is wrong.

This is the gap. AI-generated code is excellent at producing code that compiles and passes assertions written against the previous version of itself. It is uniquely bad at preserving the parts of behavior nobody thought to write an assertion for — and consumers of your API depend on a much larger surface than your assertions cover.

Why your existing safety net doesn't catch this

Five common defenses, none of which catches the bug above:

Unit tests assert what the author thought to assert. If the new test asserts the new shape and the old test was deleted (which the AI helpfully did, because "test was outdated"), there is no remaining check on the original behavior.

Type checkers see the new return type and the new function signature. They are internally consistent. The fact that consumers in three other repos expected a different shape is not visible to the type checker — it can only see the file in front of it.

Integration tests assert at the integration boundary, but the boundary they assert against was usually defined by the same team that owns the endpoint. Edge fields, header shapes, ordering, and error formats are routinely under-asserted.

SAST / SCA tools look for known vulnerability patterns and dependency CVEs. A shape-change refactor produces no signal here. There is no CVE for "developer changed an API response shape."

Manual code review is the actual safety net that catches most of these — until it doesn't. AI-generated refactors are arriving faster than human reviewers can think about every downstream consumer of every endpoint.

The gap is not in any of these tools individually. The gap is that all of them verify what was thought to be true. None of them verify what the system actually does on real traffic.

What runtime verification is, mechanically

Runtime verification works at the network request/response layer. It does three things:

  1. Records real HTTP/API traffic from any source you already have — your existing test suite, a security scanner, a staging environment, or production. Production traffic is not required. Whatever traffic you already generate is enough to start.
  2. Replays that recorded traffic against two versions of your service: the candidate build (the one with the AI-generated change) and the baseline (the version currently in production).
  3. Compares the two response sets field by field. Status code, body fields, headers, latency, downstream calls. Every difference is surfaced. Nothing is filtered by "what someone thought to assert."

The output is a field-level diff between two versions of your service. The class of bugs this catches has nothing to do with whether anyone predicted them. It depends only on whether the response actually changed.

For the bug above, the diff would surface immediately:

GET /users (active_only=false)
  status:   200 → 200
- body:     {"users": [...], "count": 14}
+ body:     [...]

  type:     object → array
  fields removed: users, count

This is what your CI gate should be reading before it decides whether to merge.

Where this fits in your pipeline

The natural insertion point is between candidate build and promotion. Concretely:

developer (or AI agent) ships code
     ↓
CI builds and runs the existing test suite
     ↓
CI cuts a candidate image
     ↓
runtime verification replays recorded traffic against baseline + candidate
     ↓
field-level diffs surface; CI annotates the PR with every difference
     ↓
human or policy decides: ship | block | investigate

Tests run first; they catch what they're going to catch. Runtime verification runs second; it catches what tests structurally cannot. The two are complementary — runtime verification does not replace your test suite, it adds the verification layer your test suite was never designed to provide.

For most teams, the right mode is "fail the build if any unexpected field-level difference appears, with an annotation explaining what changed, and a one-click approval path for differences the developer intentionally introduced." This puts the developer in the position of confirming behavioral changes, instead of being surprised by them in production.

A walkthrough with ReGrade

ReGrade is the tool we build for this. The workflow on a refactor like the one above is:

Step 1 — capture baseline traffic. Run your existing test suite once against the baseline build, with ReGrade's sensor recording. The sensor produces a traffic file. If you have 800 tests that hit HTTP endpoints, you now have 800+ recorded request/response pairs as the baseline. No extra test writing.

Step 2 — replay against the candidate. When the AI-refactored PR opens, CI builds the candidate image and ReGrade replays the recorded traffic against it.

Step 3 — diff and report. ReGrade emits a field-level diff between baseline and candidate responses. Every difference. If the AI's refactor changed {"users": [...], "count": N} to [...], the diff shows it as a top-level shape change. If a header changed casing, the diff shows it. If latency on the 95th percentile drifted, the diff shows it.

Step 4 — annotate the PR. ReGrade posts a CI comment on the PR enumerating every detected change, grouped by endpoint, with the full request/response context. The developer sees exactly what changed, and either accepts the change (explicit, in-PR) or fixes the regression.

The whole loop takes the time of one test run plus the replay time, and runs once per PR.

When this matters most

Three situations where runtime verification has the highest payoff:

Refactoring under AI assistance. Refactors are behavior-preserving by definition. AI tools are excellent at refactors that look clean but change behavior on edges nobody asserted against. Runtime verification is the inverse of "trust the AI": you do not need to trust it, because every behavioral change shows up in the diff.

Anywhere you have downstream consumers you don't control. Internal microservices, public APIs, partner integrations. Tests in your repo cannot see what your consumers depend on. Runtime verification compares responses themselves, so it catches changes that affect any consumer.

Migrating between major versions of frameworks or libraries. A framework upgrade often changes serialization behavior, header defaults, or error formats in ways the changelog under-documents. Replaying recorded traffic against the upgraded build surfaces every such change as a diff.

What runtime verification doesn't do

To be honest about scope:

  • It does not catch bugs that affect behavior you didn't record traffic for. If a code path is never exercised by any test or scanner, it will never appear in the diff.
  • It does not replace human judgment. Some diffs are intentional and correct; the tool surfaces them, the human decides.
  • It does not detect performance regressions you didn't measure. Latency diffs require the replay environment to be reasonably representative of production.
  • It is not a security scanner. It will surface a security-relevant change (e.g., an auth check that used to fire but doesn't anymore) only if recorded traffic exercises that code path. It complements SAST/DAST, it does not replace them.

The right framing is: runtime verification adds a layer of evidence to your CI pipeline that did not exist before, specifically for the class of regression that tests structurally miss. That layer is not free, and it is not magic. It is mechanical: record, replay, diff.

Getting started

If you want to try this approach in your own pipeline, ReGrade publishes a free template repository that wires the recording, replay, and diff into a GitHub Actions workflow against a sample API. Clone it, point it at your own service, and you can have field-level runtime verification running on PRs in under an hour.

Try the ReGrade template →

The template uses ReGrade's hosted API for replay and diff. Self-hosting is also available for organizations that need traffic to stay inside their own network.

Frequently asked questions

How is runtime verification different from integration testing?

Integration tests verify what the author thought to assert at the integration boundary. Runtime verification compares every observable field of every response between two versions of the service, without depending on someone having predicted what to test. Integration tests catch the assertions you wrote; runtime verification catches the changes you didn't anticipate.

Do I need production traffic to use this?

No. ReGrade can record traffic from your existing test suite, a security scanner, a staging environment, or any other source that generates HTTP/API traffic against your service. Production traffic works if you have it, but is not required.

Won't every harmless code change produce a diff?

Some will. ReGrade groups and normalizes diffs that are clearly non-semantic (e.g., timestamp fields, request IDs, ordering of unordered collections), and lets you mark specific fields as "expected to vary." The remaining diffs are real behavioral changes that deserve a human decision.

How is this different from Speedscale?

Both tools replay recorded traffic to validate code changes, and both now speak to AI-generated code. The practical difference is the operating model. Speedscale captures traffic with infrastructure installed in your cluster — a Kubernetes operator with sidecar proxies, or an eBPF node agent — with an emphasis on recording from production. ReGrade's sensor is a standalone proxy you run wherever your tests already run (CI, a laptop, staging), with no cluster install and no production access required, and the replay diff gates the PR with field-level annotations and an explicit in-PR approval path. If you want production-traffic capture wired into your Kubernetes estate, Speedscale is built around that. If you want runtime verification on every PR starting from the traffic your test suite already generates, that's ReGrade.

Does this work with gRPC, GraphQL, or only REST?

ReGrade's sensor captures at the network request/response layer. Any protocol whose responses are observable as structured field data can be diffed. REST/JSON is the most common, but gRPC and GraphQL are also supported.

What's the performance overhead of the sensor during recording?

The sensor is a recording proxy, so it adds one network hop to each request while recording. Recording is opt-in per environment: the sensor only sits in the request path when you run it, and there is no production overhead unless you choose to record production traffic. The free template lets you measure the impact in your own pipeline before committing to anything.

Can the AI agent itself get evidence from runtime verification, not just the human reviewer?

Yes. ReGrade exposes its diff output through an MCP server, which lets an AI coding agent read the same field-level evidence the human reviewer sees, and revise its refactor on the basis of that evidence. This is the loop we are most interested in: AI generates a change, runtime verification surfaces what actually changed, AI refines or escalates with grounded evidence rather than guesses.

How to Catch AI-Generated Code Regressions Before They Ship | Curtail