How We Found a 7-Year-Old Vulnerability — On the First Replay

We ran 22 standard API tests against a widely-used open-source collaboration platform. Login, user management, teams, channels, posts — standard CRUD operations using the platform's official SDK. No security assertions. No schema validators. No contract tests. 287 API requests total.

The only change: pointing the test harness at ReGrade's recording proxy instead of the platform directly. One environment variable. All 22 tests passed unchanged.

The Pipeline

Record on main. Replay on every merge request. Same version both sides.

The key insight: a fresh server instance generates different bcrypt salts. If a password hash is leaked in an API response, the hash values naturally differ between recording and replay — and ReGrade flags it automatically. No version change required. The entropy itself is the signal.

example ReGrade merge request comments

ReGrade's merge request comment flagging 2 security findings — bcrypt password hashes in API response bodies.

From 3,730 Deltas to 2 Findings

Raw replay produced 3,730 deltas — overwhelming. But that's expected: dynamic IDs, timestamps, and session tokens all produce legitimate differences between server instances.

After configuring 5 ID mapping namespaces, the count dropped 83% to 624. Then 12 filter rules across 4 iterations classified 653 of 655 remaining deltas as expected noise.

What remained: 2 security findings. Both on the $.password field. Both showing bcrypt hashes in API response bodies.

The Reveal: CVE-2023-5968

Those 2 findings were CVE-2023-5968 — Password Hash Disclosure via API. When a user updated their username, the server returned the full user object including the bcrypt password hash in the response body. The fix was one missing Sanitize() call in a single code path.

This bug existed since 2017.** It survived 7 years of unit tests, integration tests, code reviews, and security audits. Every CI pipeline passed. Every code review approved. The vulnerability was structurally invisible to any tool that validates expectations.**

ReGrade found it on the first replay — without a CVE database, without a vulnerability feed, without any prior knowledge of the vulnerability. Pure behavioral comparison.

Why Tests Miss This

Traditional tests validate what you expect. You write an assertion for every behavior you anticipate. If you don't anticipate a password hash in an API response — and why would you? — no test catches it.

ReGrade works differently. It doesn't check specific fields against expected values. It compares entire responses field by field between two server instances. Any difference that isn't accounted for by the noise profile is a finding. The bcrypt hash showed up because fresh instances use different salts — the entropy itself was the detection mechanism.

The Takeaway

If your tests could find vulnerabilities they weren't looking for, what would they find?

ReGrade catches what tests structurally cannot — not because your tests are bad, but because testing expectations and detecting unknowns are fundamentally different capabilities. One validates what you know. The other surfaces what you don't.