
Claude Mythos Preview: Benchmarks, Exploit Chains, and the Technical Deep Dive
TL;DR: „Mythos Preview doubles Opus 4.6 on SWE-bench Pro, solves all 35 CTF challenges on the first attempt, and autonomously developed a 6-stage ROP chain exploit for remote root on FreeBSD. Cost for the OpenBSD scan: under $20,000."
— Till FreitagThe Benchmark Data in Detail
On April 7, 2026, Anthropic published the complete benchmark data for Claude Mythos Preview. The numbers confirm what the March leak hinted at – and exceed it.
Coding Benchmarks
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | — |
| SWE-bench Pro | 77.8% | 53.4% | 57.7% |
| Terminal-Bench 2.0 | 82.0% | 65.4% | — |
| SWE-bench Multimodal | 59.0% | 27.1% | — |
SWE-bench Verified: 13.1 percentage point lead. SWE-bench Pro: 24.4 points. Most dramatic is SWE-bench Multimodal – Mythos more than doubles Opus 4.6's score.
Reasoning and Mathematics
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| GPQA Diamond | 94.6% | — | — |
| HLE (with tools) | 64.7% | 53.1% | — |
| USAMO 2026 | 97.6% | 42.3% | — |
The USAMO results are the most striking gap in the entire dataset: 97.6% vs. 42.3% – a 55.3 percentage point difference on a competitive mathematics exam.
Agentic Tasks and Long Context
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| OSWorld | 79.6% | — | — |
| BrowseComp | 86.9% | — | — |
| GraphWalks (256K–1M tokens) | 80.0% | 38.7% | 21.4% |
GraphWalks tests reasoning over extremely long contexts from 256K to 1 million tokens. Mythos scores 80.0% – a 4x improvement over GPT-5.4.
Cybersecurity
| Benchmark | Mythos Preview | Opus 4.6 |
|---|---|---|
| CyberGym | 83.1% | 66.6% |
| Cybench (pass@1, 10 attempts) | 100% | — |
Cybench comprises 35 CTF challenges. Mythos Preview solved every single one on the first attempt across 10 runs.
The Three Historic Vulnerabilities
The aggregate numbers are impressive. But only the concrete cases make the scale tangible.
Case 1: OpenBSD – 27-Year TCP SACK Vulnerability
OpenBSD is widely considered one of the most hardened operating systems in existence. It runs on firewalls and critical infrastructure worldwide. Its codebase has been subjected to continuous security auditing for decades.
Mythos Preview found a vulnerability in OpenBSD's TCP SACK implementation that had been present since 1998.
The bug is extraordinarily subtle, involving the interaction of two independent flaws:
Flaw 1: The SACK protocol allows receivers to selectively acknowledge received data packet ranges. OpenBSD's implementation checked only the upper bound of SACK ranges, not the lower bound. This alone is typically harmless.
Flaw 2: Under specific conditions, a null pointer write can be triggered. But under normal circumstances, this code path is unreachable because it requires two mutually exclusive conditions to be satisfied simultaneously.
The breakthrough: TCP sequence numbers are 32-bit signed integers. Mythos Preview discovered that by setting the SACK starting point approximately 2^31 away from the normal window using Flaw 1, two comparison operations simultaneously overflow the sign bit. The kernel is tricked – the "impossible" conditions are both satisfied, and the null pointer write fires.
Impact: Anyone who can connect to the target machine can remotely crash it.
27 years. Countless manual audits and automated scans. Nobody found it. Total cost for the scanning project: under $20,000 – roughly one week's salary for a senior penetration testing engineer.
Case 2: FFmpeg – 16-Year H.264 Decoder Vulnerability
FFmpeg is the most widely used video codec library in the world and one of the most thoroughly fuzzed open-source projects in existence.
Mythos Preview found a vulnerability in the H.264 decoder introduced in 2010, with roots traceable to 2003.
The issue is a seemingly innocent type mismatch: the table recording slice assignments uses 16-bit integers. The slice counter itself is a 32-bit int.
Normal video frames contain only a few slices, so the 16-bit upper limit of 65,536 is never reached. The table is initialized using memset(..., -1, ...), making 65,535 the sentinel value for "empty slot."
The attack: Construct a video frame containing 65,536 slices. Slice number 65,535 collides with the sentinel value. The decoder misidentifies it as an empty slot, triggering an out-of-bounds write.
In the 16 years since introduction, automated fuzzers executed 5 million runs on this line of code without triggering it. The trigger condition – a frame with exactly 65,536 slices – is astronomically unlikely through random fuzzing but trivial to construct deliberately.
Case 3: FreeBSD NFS – 17-Year Remote Root (CVE-2026-4747)
This is the case that made security researchers' blood run cold.
Mythos Preview fully autonomously discovered and exploited a remote code execution vulnerability in FreeBSD's NFS server that had existed for 17 years.
"Fully autonomous" means that after the initial prompt, no human was involved in any stage of discovery or exploit development.
Impact: An attacker from anywhere on the internet can gain complete root privileges on the target server without any authentication.
The vulnerability: A stack buffer overflow in the NFS server's authentication request handler. Attacker-controlled data is copied directly into a 128-byte stack buffer, but the length check allows up to 400 bytes.
Why existing protections failed: FreeBSD's kernel is compiled with -fstack-protector, but this option only protects functions containing char arrays. This buffer was declared as int32_t[32] – the compiler does not insert a stack canary. FreeBSD also does not implement kernel address space layout randomization.
The exploit: The complete ROP chain exceeds 1,000 bytes, but the stack overflow provides only 200 bytes of space. Mythos Preview's solution: split the attack across 6 consecutive RPC requests. The first 5 write data blocks into kernel memory. The 6th triggers the final payload, appending the attacker's SSH public key to /root/.ssh/authorized_keys.
For comparison, an independent security research firm previously demonstrated that Opus 4.6 could also exploit this same vulnerability – but only with human guidance. Mythos Preview needed none.
Beyond These Three
In addition to these three patched cases, Anthropic's red team blog disclosed SHA-3 hash commitments for a large number of unpatched vulnerabilities spanning every major operating system, every major browser, and multiple cryptographic libraries. Over 99% remain unpatched and cannot be publicly disclosed.
Another test: Mythos Preview was given a list of 100 known CVEs, asked to identify the 40 most exploitable, and write privilege escalation exploits for each. Success rate: over 50%.
One exploit started from a 1-bit adjacent physical page write primitive. Through precise kernel memory layout manipulation – slab spraying, page table page alignment, and PTE permission bit flipping – it ultimately rewrote the first page of /usr/bin/passwd with a 168-byte ELF stub calling setuid(0) for root access. Total cost: under $1,000.
The Firefox 147 Comparison
The most striking direct comparison between model generations:
Anthropic tested both models against Firefox 147's JavaScript engine:
- Opus 4.6: Hundreds of attempts, 2 working exploits
- Mythos Preview: 250 attempts, 181 working exploits + 29 additional instances of register control
The red team blog put it this way: "Last month, we wrote that Opus 4.6 was far better at finding issues than exploiting them. Internal assessments showed Opus 4.6's success rate at autonomous exploit development was essentially zero. Mythos Preview is an entirely different level."
Pricing and Access
Mythos Preview pricing is set at 5x Opus 4.6 rates:
- Input: $25 per million tokens
- Output: $125 per million tokens
Access through four platforms:
- Claude API (direct)
- Amazon Bedrock
- Google Vertex AI
- Microsoft Foundry
The restricted access model reflects both the model's capabilities and the risks documented in the accompanying 244-page System Card.
The Uncomfortable Truth
The red team blog ends with a judgment worth repeating: these capabilities emerged as a downstream result of general improvements in code understanding, reasoning, and autonomy. The same improvements that make AI dramatically better at fixing problems also make it dramatically better at exploiting them.
No specialized training. Pure general intelligence improvement as a side effect.
The global cybersecurity industry loses approximately $500 billion annually to cybercrime. That industry just discovered its biggest emerging threat arrived as a byproduct of someone solving math problems.
Boris Cherny – creator of Claude Code – offered a succinct assessment: "Mythos is very powerful, and it will frighten people."
Conclusion
Claude Mythos Preview is not an incremental update. It is a generational leap. The benchmark data shows this across every dimension – coding, reasoning, mathematics, agent tasks, long context, cybersecurity.
The question is no longer whether these capabilities are real. The question is what happens next – and whether defenders are fast enough to leverage the tools Anthropic is putting in their hands through Project Glasswing.








