Why is Claude Mythos suddenly finding so many old vulnerabilities?

The bugs were always there, and fuzzing — which is cheap and ran on this code for years — never found them. What changed isn't cost; it's that a model can now reason about code the way a vulnerability researcher does, following the multi-step conditions that make a flaw exploitable, at a scale no research team could staff. That makes it economical to deeply audit decades-old code everyone assumed was safe. It is a capability shift across multiple vendors, not an access one — the UK AI Security Institute found OpenAI's GPT-5.5 at a similar overall cyber-offense level to Claude Mythos.

How do I prepare for Mythos-class models in the hands of threat actors?

Assume attackers can now point the same vulnerability-finding models at your code that researchers use. Two durable moves: (1) remove the bug class these models are best at finding — migrate memory-unsafe C/C++ to a memory-safe language, starting with network-facing code — and (2) make every migration verifiable, so you can remediate at AI speed without shipping behavioral regressions. Speed of remediation is the new advantage; field-level verification is what makes that speed safe.

Are most of these vulnerabilities really memory-safety issues?

Yes. Microsoft, the Chromium project, and Mozilla independently found that roughly 70% of their severe security vulnerabilities are memory-safety bugs — out-of-bounds reads and writes, and use-after-free. The recent FFmpeg find fits the pattern exactly: an out-of-bounds write, the spatial-corruption class a memory-safe language removes by construction.

Does migrating C/C++ to a memory-safe language actually eliminate these bugs?

It eliminates the class. A memory-safe language — Rust, Go, Swift, C#, Java, and others — removes the entire category of memory-corruption bugs that accounts for roughly 70% of severe CVEs. It does not remove logic errors, and escape hatches (like Rust's `unsafe` blocks) still need scrutiny. Google's Android team saw memory-safety vulnerabilities fall from 76% of the total in 2019 to under 20% in 2025 as new code shifted to memory-safe languages.

If AI rewrites my code, how do I know it still behaves the same?

You compare behavior, not code. Record the real API traffic your current service handles, replay that exact traffic against the refactored version, and diff every response field. If the new version returns the same status codes, bodies, headers, and fields (minus expected dynamic values), you have deterministic evidence the rewrite preserved behavior. If something changed, you see exactly which field, where, and how.

How is this different from just running my test suite?

Tests check what someone thought to assert. A UCLA study found only 22% of refactored code is covered by existing regression tests, and testing alone caught just 13% of seeded refactoring faults. Field-level behavioral comparison does not depend on anyone having predicted the failure — it compares the whole observable response between the version you trust and the version you are evaluating.

Claude Mythos Is Now Hunting Your Memory-Safety Bugs. Migrate to a Memory-Safe Language — and Prove the Rewrite Behaves.

Q: Is migrating to a memory-safe language required by the government?

Not as a statute, but close in practice. NSA and CISA jointly published guidance urging migration off memory-unsafe languages, and CISA's Secure by Design guidance asks software makers to publish a memory-safety roadmap by January 1, 2026. For federal contractors, that "voluntary" guidance tends to become mandatory through procurement requirements.

In April 2026, Anthropic's Claude Mythos read FFmpeg's H.264 decoder and found an out-of-bounds write that had survived 16 years and more than five million automated tests: craft a frame with exactly 65,536 slices, collide a slice index with a sentinel value, and the decoder writes past its buffer. That is memory corruption — the bug class behind most severe CVEs, and the exact class a memory-safe language eliminates by construction.

That's the distinction that matters. Out-of-bounds writes and use-after-frees — the corruption that turns into remote code execution — are exactly what a memory-safe language takes off the table by construction. (It won't catch every bug; logic errors survive, and a safe program can still crash. But this class, the one behind most severe CVEs, is gone.) And FFmpeg isn't obscure — it decodes video in nearly every device with a screen. Old, mechanical, and now findable on demand.

The Bugs Were Always There. What Changed Is Who's Looking.

Nothing about that FFmpeg code changed in 16 years. Fuzzers had been hammering it the whole time, essentially for free — and never found this bug. It takes a frame with exactly 65,536 slices to trigger, a needle no random fuzzer stumbles onto.

So what changed isn't the cost of looking; fuzzing was always cheap. What changed is that a machine can now reason about code the way a vulnerability researcher does — follow a multi-step condition, work out why one specific input breaks an invariant — and do it at a scale no team of researchers could ever staff. Anthropic's own red team reported Mythos writing exploits in hours that expert penetration testers said would have taken weeks. But the skill level is almost beside the point: nobody was ever going to pay even a top researcher to deeply audit decades-old code that everyone assumed was safe. Now that audit is a script you can run.

And don't get distracted by who has the models. Threat actors have had frontier models since this race began — public APIs, open weights, all of it — and Mythos isn't dramatically more capable at this than other available models; the UK AI Security Institute found OpenAI's GPT-5.5 at a similar overall cyber-offense level. What crossed a threshold isn't access. It's that the models got good enough to turn "presumed safe" legacy code into a list of real findings — and that capability is still climbing. Assume the same scrutiny is now aimed at your own old C and C++.

Almost All of It Is Memory Safety

Here is the pattern — in this find, and in the decades of severe CVEs before it: it is overwhelmingly memory safety.

Three of the largest software producers in the world independently converged on the same number. Microsoft reported that ~70% of the CVEs it assigns each year are memory-safety issues. Chromium found ~70% of its high-severity bugs are memory unsafety. Mozilla found 94% of critical and high bugs in one component were memory-related — and 74% would have been impossible in a memory-safe language.

The FFmpeg out-of-bounds write isn't an outlier. It's the 70% — the memory-corruption class those numbers describe.

The Fix Is a Language Change — and the Government Is Pushing It

You cannot test your way out of an entire bug class. You can remove it.

A memory-safe language eliminates the category of memory-corruption bugs by construction — at the compiler or runtime level, not "reduced with discipline." CISA's guidance names Rust, Go, Swift, C#, Java, Python, and JavaScript among the safe options. The right target depends on your domain — Rust or Go for systems code and services, Swift on Apple platforms, C# or Java for managed back ends — but the property that matters is shared: the memory-corruption bug class is gone, save for the narrow escape hatches (like Rust's unsafe) you can audit deliberately.

The U.S. government has made this the expected direction of responsible engineering. NSA and CISA jointly published Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development, and CISA's Secure by Design guidance asks software makers to publish a memory-safety roadmap by January 1, 2026. It is framed as guidance, not statute — but for anyone selling to the federal government, voluntary CISA guidance has a way of becoming mandatory through procurement. The direction is set: migrate the memory-unsafe code, starting with anything network-facing.

The results, where teams have done it, are not subtle. Google's Android team watched memory-safety vulnerabilities fall from 76% of the total in 2019 to under 20% in 2025 as new development moved to memory-safe languages, with annual memory-safety bugs dropping from 223 to fewer than 50.

AI Can Do the Rewrite. That's Exactly the Problem.

For decades, the reason organizations didn't migrate off C/C++ was simple: rewriting a large, load-bearing codebase by hand was too slow to attempt and too risky to trust. AI makes it fast. It does nothing about the risk.

AI agents can transpile C/C++ to Rust, modernize APIs, and rewrite services at a pace that used to require armies of engineers. But refactoring is dangerous even when humans do it carefully. Microsoft Research surveyed 328 engineers and found 76% said refactoring risks introducing bugs and regressions. A UCLA study found only 22% of refactored code is covered by existing regression tests — and when teams relied on tests alone, they caught just 13% of seeded refactoring faults. That's a ~75% miss rate, before you let an AI rewrite thousands of lines per session, each line a confident best guess.

So you've fixed the bottleneck and amplified the risk. An AI rewrite that compiles and passes your tests can still return a subtly different response on an edge case nobody asserted — and now you've shipped a behavioral regression into the service you migrated for safety reasons.

The question is no longer "can we rewrite it?" It's "how do we prove the rewrite behaves identically?"

Refactor at AI Speed, Verify at Field Level

You don't prove behavioral equivalence by reading code. You prove it by comparing behavior.

This is the use case ReGrade was built for:

Baseline. Record the complete API behavior of your current service — every endpoint, response, header, and field. This is your source of truth: the behavioral contract the new version must match.
Refactor. Point your AI agent at the codebase and let it migrate C/C++ to a memory-safe language, modernize, rewrite. The AI handles generation.
Verify. Replay the recorded traffic against the refactored service and diff every response field by field. The language changed, the compiler changed, the memory model changed — ReGrade checks whether the observable behavior stayed the same.
Iterate. Feed the structured diffs back to the AI agent through ReGrade's MCP server. The agent sees exactly which field changed and self-corrects — grounded in real behavioral evidence, not guesses. Repeat until the delta is zero or every remaining difference is an improvement you've explicitly approved.

What you get isn't a test suite's opinion or a reviewer's best effort. It's deterministic, field-level evidence that the rewritten, memory-safe version behaves identically to the C/C++ version it replaces — across every interaction you recorded, not a sample.

The Migrations That Worked Had Proof

Whatever language you target, the reference migrations that succeeded share one trait: the teams could verify the new implementation preserved the old one's behavioral contract. (The famous systems rewrites happen to be Rust — the verification principle is identical whether you land on Go, Swift, or a managed runtime.)

Cloudflare's Pingora, a Rust proxy replacing NGINX, now serves over a trillion requests a day with 70% less CPU and 67% less memory — and zero crashes from service code since inception. Discord's Go-to-Rust rewrite of a core service erased its latency spikes entirely. These weren't leaps of faith. They were migrations where someone could confirm the new code did what the old code did.

At the scale of a modern service, faith doesn't scale. Verification does.

Preparing for Mythos-Class Models in the Hands of Threat Actors

The scrutiny just changed. Finding the memory-safety bug in your legacy code used to require a scarce, expensive vulnerability researcher — so nobody ever pointed one at code presumed safe. Now that depth of analysis is automated and scalable, and it's aimed at your code by anyone with a frontier model, threat actors included. The defensible move is to migrate the unsafe code off the bug class entirely, fast, and prove each rewrite preserved behavior before it ships.

AI gives you the speed. ReGrade gives you the proof.

See how it works — start free at curtail.com. Free plan, no credit card, no expiration.

Sources

Anthropic, "Claude Mythos Preview" cybersecurity assessment — FFmpeg 16-year H.264 out-of-bounds write. anthropic.com
UK AI Security Institute, via The Decoder — GPT-5.5 reaches similar overall cyber-offense capability to Claude Mythos; Mythos leads on vulnerability discovery. the-decoder.com
Microsoft MSRC — ~70% of CVEs are memory-safety issues. msrc.microsoft.com
Chromium Project — ~70% of high-severity bugs are memory unsafety. chromium.org
Mozilla Hacks — 94% of critical/high CSS-component bugs memory-related; 74% impossible in Rust. hacks.mozilla.org
NSA/CISA, "Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development" (June 2025). media.defense.gov
CISA, Secure by Design Pledge — memory-safety roadmap guidance. cisa.gov
The New Stack — federal guidance pushing critical software off C/C++ with a 2026 roadmap expectation. thenewstack.io
Google Security Blog — Android memory-safety vulnerabilities 76% (2019) → under 20% (2025); 223 → fewer than 50 annual bugs. security.googleblog.com
Microsoft Research (Kim, Zimmermann, Nagappan) — 76% of engineers say refactoring risks regressions. microsoft.com
UCLA (Kim & Prete) — 22% of refactored code covered by regression tests; 13% fault detection with testing alone. web.cs.ucla.edu
Cloudflare, Pingora — 1T+ requests/day, 70% less CPU, 67% less memory, zero service-code crashes. blog.cloudflare.com
Discord Engineering — Go→Rust rewrite eliminated latency spikes. discord.com/blog