Skip to main content
Curtail
Back to News
·Skyler Lister-Aley

Claude Mythos Is Now Hunting Your Memory-Safety Bugs. Migrate to a Memory-Safe Language — and Prove the Rewrite Behaves.

Anthropic's Claude Mythos just found 16-to-27-year-old vulnerabilities in OpenBSD and FFmpeg — and almost all are memory-safety bugs. As Mythos-class models reach threat actors, the defensible response is to migrate C/C++ to a memory-safe language at AI speed and prove the rewrite behaves identically. Here's how the verification works.

memory-safetyrustrefactoringai-coderuntime-verificationregrade

In April 2026, Anthropic's Claude Mythos read OpenBSD's networking code and found a crash bug that had been hiding for 27 years. The same model surfaced a flaw in FFmpeg's H.264 decoder that had survived 16 years and more than five million automated tests.

Neither bug was exotic. The OpenBSD flaw is a null-pointer dereference triggered by a signed-integer overflow in TCP sequence-number math. The FFmpeg flaw is an out-of-bounds write: craft a frame with exactly 65,536 slices, collide a slice index with a sentinel value, and the decoder writes past its buffer. Old, mechanical, devastating — and now findable on demand.

The Bugs Were Always There. Now AI Finds Them Cheaply.

Nothing about that OpenBSD code changed in 27 years. What changed is the cost of finding the bug in it.

Mythos-class models read code from angles fuzzers never take and reason through the multi-step conditions that make a flaw exploitable — at a price that collapses the economics of vulnerability research. Anthropic reported Claude Mythos surfacing serious findings across major open-source systems in single research engagements for a few thousand dollars each.

This is not one lab's party trick. The UK AI Security Institute found that OpenAI's GPT-5.5 reaches a similar overall cyber-offense level to Claude Mythos — GPT-5.5 was only the second model to fully solve a complex multi-stage attack simulation. Mythos still leads on dedicated vulnerability discovery, but the trend matters more than the ranking: the firepower aimed at your old C and C++ is multiplying across vendors, and it is getting cheaper every quarter.

Mythos-class models don't stay inside vendor red teams. Assume the same capability is already in threat actors' hands — and that your legacy code is being read.

Almost All of It Is Memory Safety

Here is the pattern in the recent finds, and in the decades before them: it is overwhelmingly memory safety.

Three of the largest software producers in the world independently converged on the same number. Microsoft reported that ~70% of the CVEs it assigns each year are memory-safety issues. Chromium found ~70% of its high-severity bugs are memory unsafety. Mozilla found 94% of critical and high bugs in one component were memory-related — and 74% would have been impossible in a memory-safe language.

The OpenBSD null-pointer dereference and the FFmpeg out-of-bounds write aren't outliers. They're the 70%.

The Fix Is a Language Change — and the Government Is Pushing It

You cannot test your way out of an entire bug class. You can remove it.

A memory-safe language eliminates the category of memory-corruption bugs by construction — at the compiler or runtime level, not "reduced with discipline." CISA's guidance names Rust, Go, Swift, C#, Java, Python, and JavaScript among the safe options. The right target depends on your domain — Rust or Go for systems code and services, Swift on Apple platforms, C# or Java for managed back ends — but the property that matters is shared: the memory-corruption bug class is gone, save for the narrow escape hatches (like Rust's unsafe) you can audit deliberately.

The U.S. government has made this the expected direction of responsible engineering. NSA and CISA jointly published Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development, and CISA's Secure by Design guidance asks software makers to publish a memory-safety roadmap by January 1, 2026. It is framed as guidance, not statute — but for anyone selling to the federal government, voluntary CISA guidance has a way of becoming mandatory through procurement. The direction is set: migrate the memory-unsafe code, starting with anything network-facing.

The results, where teams have done it, are not subtle. Google's Android team watched memory-safety vulnerabilities fall from 76% of the total in 2019 to under 20% in 2025 as new development moved to memory-safe languages, with annual memory-safety bugs dropping from 223 to fewer than 50.

AI Can Do the Rewrite. That's Exactly the Problem.

For decades, the reason organizations didn't migrate off C/C++ was simple: rewriting a large, load-bearing codebase by hand is prohibitively slow and risky. AI changes the first half of that sentence. It does nothing for the second.

AI agents can transpile C/C++ to Rust, modernize APIs, and rewrite services at a pace that used to require armies of engineers. But refactoring is dangerous even when humans do it carefully. Microsoft Research surveyed 328 engineers and found 76% said refactoring risks introducing bugs and regressions. A UCLA study found only 22% of refactored code is covered by existing regression tests — and when teams relied on tests alone, they caught just 13% of seeded refactoring faults. That's a ~75% miss rate, before you let an AI rewrite thousands of lines per session, each line a confident best guess.

So you've fixed the bottleneck and amplified the risk. An AI rewrite that compiles and passes your tests can still return a subtly different response on an edge case nobody asserted — and now you've shipped a behavioral regression into the service you migrated for safety reasons.

The question is no longer "can we rewrite it?" It's "how do we prove the rewrite behaves identically?"

Refactor at AI Speed, Verify at Field Level

You don't prove behavioral equivalence by reading code. You prove it by comparing behavior.

This is the use case ReGrade was built for:

  • Baseline. Record the complete API behavior of your current service — every endpoint, response, header, and field. This is your source of truth: the behavioral contract the new version must match.
  • Refactor. Point your AI agent at the codebase and let it migrate C/C++ to a memory-safe language, modernize, rewrite. The AI handles generation.
  • Verify. Replay the recorded traffic against the refactored service and diff every response field by field. The language changed, the compiler changed, the memory model changed — ReGrade checks whether the observable behavior stayed the same.
  • Iterate. Feed the structured diffs back to the AI agent through ReGrade's MCP server. The agent sees exactly which field changed and self-corrects — grounded in real behavioral evidence, not guesses. Repeat until the delta is zero or every remaining difference is an improvement you've explicitly approved.

What you get isn't a test suite's opinion or a reviewer's best effort. It's deterministic, field-level evidence that the rewritten, memory-safe version behaves identically to the C/C++ version it replaces — across every interaction you recorded, not a sample.

The Migrations That Worked Had Proof

Whatever language you target, the reference migrations that succeeded share one trait: the teams could verify the new implementation preserved the old one's behavioral contract. (The famous systems rewrites happen to be Rust — the verification principle is identical whether you land on Go, Swift, or a managed runtime.)

Cloudflare's Pingora, a Rust proxy replacing NGINX, now serves over a trillion requests a day with 70% less CPU and 67% less memory — and zero crashes from service code since inception. Discord's Go-to-Rust rewrite of a core service erased its latency spikes entirely. These weren't leaps of faith. They were migrations where someone could confirm the new code did what the old code did.

At the scale of a modern service, faith doesn't scale. Verification does.

Preparing for Mythos-Class Models in the Hands of Threat Actors

The economics just inverted. Finding the memory-safety bug in your legacy code used to be expensive and rare; now it's cheap and increasingly automated — for you and for the threat actors pointing Mythos-class models at your code. The defensible move is to migrate the unsafe code off the bug class entirely, fast, and prove each rewrite preserved behavior before it ships.

AI gives you the speed. ReGrade gives you the proof.

See how it works — start free at curtail.com. Free plan, no credit card, no expiration.


Sources

  • Anthropic, "Claude Mythos Preview" cybersecurity assessment — OpenBSD 27-year null-pointer dereference; FFmpeg 16-year out-of-bounds write. anthropic.com
  • UK AI Security Institute, via The Decoder — GPT-5.5 reaches similar overall cyber-offense capability to Claude Mythos; Mythos leads on vulnerability discovery. the-decoder.com
  • Microsoft MSRC — ~70% of CVEs are memory-safety issues. msrc.microsoft.com
  • Chromium Project — ~70% of high-severity bugs are memory unsafety. chromium.org
  • Mozilla Hacks — 94% of critical/high CSS-component bugs memory-related; 74% impossible in Rust. hacks.mozilla.org
  • NSA/CISA, "Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development" (June 2025). media.defense.gov
  • CISA, Secure by Design Pledge — memory-safety roadmap guidance. cisa.gov
  • The New Stack — federal guidance pushing critical software off C/C++ with a 2026 roadmap expectation. thenewstack.io
  • Google Security Blog — Android memory-safety vulnerabilities 76% (2019) → under 20% (2025); 223 → fewer than 50 annual bugs. security.googleblog.com
  • Microsoft Research (Kim, Zimmermann, Nagappan) — 76% of engineers say refactoring risks regressions. microsoft.com
  • UCLA (Kim & Prete) — 22% of refactored code covered by regression tests; 13% fault detection with testing alone. web.cs.ucla.edu
  • Cloudflare, Pingora — 1T+ requests/day, 70% less CPU, 67% less memory, zero service-code crashes. blog.cloudflare.com
  • Discord Engineering — Go→Rust rewrite eliminated latency spikes. discord.com/blog

Frequently asked questions

Why is Claude Mythos suddenly finding so many old vulnerabilities?

The bugs were always there. What changed is the cost of finding them. Mythos-class models can now read code from angles fuzzers never took and reason about complex multi-step conditions, so flaws that survived decades and millions of automated tests are surfacing in single research runs for a few thousand dollars. This is a capability across multiple vendors, not one lab — the UK AI Security Institute found OpenAI's GPT-5.5 reaches a similar overall cyber-offense level to Anthropic's Claude Mythos, with Mythos still ahead on dedicated vulnerability-finding.

How do I prepare for Mythos-class models in the hands of threat actors?

Assume attackers can now point the same vulnerability-finding models at your code that researchers use. Two durable moves: (1) remove the bug class these models are best at finding — migrate memory-unsafe C/C++ to a memory-safe language, starting with network-facing code — and (2) make every migration verifiable, so you can remediate at AI speed without shipping behavioral regressions. Speed of remediation is the new advantage; field-level verification is what makes that speed safe.

Are most of these vulnerabilities really memory-safety issues?

Yes. Microsoft, the Chromium project, and Mozilla independently found that roughly 70% of their severe security vulnerabilities are memory-safety bugs — out-of-bounds reads and writes, use-after-free, and null-pointer dereferences. The recent AI finds fit the pattern: the OpenBSD bug is a null-pointer dereference and the FFmpeg bug is an out-of-bounds write.

Does migrating C/C++ to a memory-safe language actually eliminate these bugs?

It eliminates the class. A memory-safe language — Rust, Go, Swift, C#, Java, and others — removes the entire category of memory-corruption bugs that accounts for roughly 70% of severe CVEs. It does not remove logic errors, and escape hatches (like Rust's `unsafe` blocks) still need scrutiny. Google's Android team saw memory-safety vulnerabilities fall from 76% of the total in 2019 to under 20% in 2025 as new code shifted to memory-safe languages.

Is migrating to a memory-safe language required by the government?

Not as a statute, but close in practice. NSA and CISA jointly published guidance urging migration off memory-unsafe languages, and CISA's Secure by Design guidance asks software makers to publish a memory-safety roadmap by January 1, 2026. For federal contractors, that "voluntary" guidance tends to become mandatory through procurement requirements.

If AI rewrites my code, how do I know it still behaves the same?

You compare behavior, not code. Record the real API traffic your current service handles, replay that exact traffic against the refactored version, and diff every response field. If the new version returns the same status codes, bodies, headers, and fields (minus expected dynamic values), you have deterministic evidence the rewrite preserved behavior. If something changed, you see exactly which field, where, and how.

How is this different from just running my test suite?

Tests check what someone thought to assert. A UCLA study found only 22% of refactored code is covered by existing regression tests, and testing alone caught just 13% of seeded refactoring faults. Field-level behavioral comparison does not depend on anyone having predicted the failure — it compares the whole observable response between the version you trust and the version you are evaluating.

Claude Mythos Is Now Hunting Your Memory-Safety Bugs. Migrate to a Memory-Safe Language — and Prove the Rewrite Behaves. | Curtail