Claude Mythos Preview: Benchmarks, Exploit Chains, and the Technical Deep Dive

11. April 20267 min Lesezeit

TL;DR: „Mythos Preview doubles Opus 4.6 on SWE-bench Pro, solves all 35 CTF challenges on the first attempt, and autonomously developed a 6-stage ROP chain exploit for remote root on FreeBSD. Cost for the OpenBSD scan: under $20,000."

— Till Freitag

The Benchmark Data in Detail

On April 7, 2026, Anthropic published the complete benchmark data for Claude Mythos Preview. The numbers confirm what the March leak hinted at – and exceed it.

Coding Benchmarks

Benchmark	Mythos Preview	Opus 4.6	GPT-5.4
SWE-bench Verified	93.9%	80.8%	—
SWE-bench Pro	77.8%	53.4%	57.7%
Terminal-Bench 2.0	82.0%	65.4%	—
SWE-bench Multimodal	59.0%	27.1%	—

SWE-bench Verified: 13.1 percentage point lead. SWE-bench Pro: 24.4 points. Most dramatic is SWE-bench Multimodal – Mythos more than doubles Opus 4.6's score.

Reasoning and Mathematics

Benchmark	Mythos Preview	Opus 4.6	GPT-5.4
GPQA Diamond	94.6%	—	—
HLE (with tools)	64.7%	53.1%	—
USAMO 2026	97.6%	42.3%	—

The USAMO results are the most striking gap in the entire dataset: 97.6% vs. 42.3% – a 55.3 percentage point difference on a competitive mathematics exam.

Agentic Tasks and Long Context

Benchmark	Mythos Preview	Opus 4.6	GPT-5.4
OSWorld	79.6%	—	—
BrowseComp	86.9%	—	—
GraphWalks (256K–1M tokens)	80.0%	38.7%	21.4%

GraphWalks tests reasoning over extremely long contexts from 256K to 1 million tokens. Mythos scores 80.0% – a 4x improvement over GPT-5.4.

Cybersecurity

Benchmark	Mythos Preview	Opus 4.6
CyberGym	83.1%	66.6%
Cybench (pass@1, 10 attempts)	100%	—

Cybench comprises 35 CTF challenges. Mythos Preview solved every single one on the first attempt across 10 runs.

The Three Historic Vulnerabilities

The aggregate numbers are impressive. But only the concrete cases make the scale tangible.

Case 1: OpenBSD – 27-Year TCP SACK Vulnerability

OpenBSD is widely considered one of the most hardened operating systems in existence. It runs on firewalls and critical infrastructure worldwide. Its codebase has been subjected to continuous security auditing for decades.

Mythos Preview found a vulnerability in OpenBSD's TCP SACK implementation that had been present since 1998.

The bug is extraordinarily subtle, involving the interaction of two independent flaws:

Flaw 1: The SACK protocol allows receivers to selectively acknowledge received data packet ranges. OpenBSD's implementation checked only the upper bound of SACK ranges, not the lower bound. This alone is typically harmless.

Flaw 2: Under specific conditions, a null pointer write can be triggered. But under normal circumstances, this code path is unreachable because it requires two mutually exclusive conditions to be satisfied simultaneously.

The breakthrough: TCP sequence numbers are 32-bit signed integers. Mythos Preview discovered that by setting the SACK starting point approximately 2^31 away from the normal window using Flaw 1, two comparison operations simultaneously overflow the sign bit. The kernel is tricked – the "impossible" conditions are both satisfied, and the null pointer write fires.

Impact: Anyone who can connect to the target machine can remotely crash it.

27 years. Countless manual audits and automated scans. Nobody found it. Total cost for the scanning project: under $20,000 – roughly one week's salary for a senior penetration testing engineer.

Case 2: FFmpeg – 16-Year H.264 Decoder Vulnerability

FFmpeg is the most widely used video codec library in the world and one of the most thoroughly fuzzed open-source projects in existence.

Mythos Preview found a vulnerability in the H.264 decoder introduced in 2010, with roots traceable to 2003.

The issue is a seemingly innocent type mismatch: the table recording slice assignments uses 16-bit integers. The slice counter itself is a 32-bit int.

Normal video frames contain only a few slices, so the 16-bit upper limit of 65,536 is never reached. The table is initialized using memset(..., -1, ...), making 65,535 the sentinel value for "empty slot."

The attack: Construct a video frame containing 65,536 slices. Slice number 65,535 collides with the sentinel value. The decoder misidentifies it as an empty slot, triggering an out-of-bounds write.

In the 16 years since introduction, automated fuzzers executed 5 million runs on this line of code without triggering it. The trigger condition – a frame with exactly 65,536 slices – is astronomically unlikely through random fuzzing but trivial to construct deliberately.

Case 3: FreeBSD NFS – 17-Year Remote Root (CVE-2026-4747)

This is the case that made security researchers' blood run cold.

Mythos Preview fully autonomously discovered and exploited a remote code execution vulnerability in FreeBSD's NFS server that had existed for 17 years.

"Fully autonomous" means that after the initial prompt, no human was involved in any stage of discovery or exploit development.

Impact: An attacker from anywhere on the internet can gain complete root privileges on the target server without any authentication.

The vulnerability: A stack buffer overflow in the NFS server's authentication request handler. Attacker-controlled data is copied directly into a 128-byte stack buffer, but the length check allows up to 400 bytes.

Why existing protections failed: FreeBSD's kernel is compiled with -fstack-protector, but this option only protects functions containing char arrays. This buffer was declared as int32_t[32] – the compiler does not insert a stack canary. FreeBSD also does not implement kernel address space layout randomization.

The exploit: The complete ROP chain exceeds 1,000 bytes, but the stack overflow provides only 200 bytes of space. Mythos Preview's solution: split the attack across 6 consecutive RPC requests. The first 5 write data blocks into kernel memory. The 6th triggers the final payload, appending the attacker's SSH public key to /root/.ssh/authorized_keys.

For comparison, an independent security research firm previously demonstrated that Opus 4.6 could also exploit this same vulnerability – but only with human guidance. Mythos Preview needed none.

Beyond These Three

In addition to these three patched cases, Anthropic's red team blog disclosed SHA-3 hash commitments for a large number of unpatched vulnerabilities spanning every major operating system, every major browser, and multiple cryptographic libraries. Over 99% remain unpatched and cannot be publicly disclosed.

Another test: Mythos Preview was given a list of 100 known CVEs, asked to identify the 40 most exploitable, and write privilege escalation exploits for each. Success rate: over 50%.

One exploit started from a 1-bit adjacent physical page write primitive. Through precise kernel memory layout manipulation – slab spraying, page table page alignment, and PTE permission bit flipping – it ultimately rewrote the first page of /usr/bin/passwd with a 168-byte ELF stub calling setuid(0) for root access. Total cost: under $1,000.

The Firefox 147 Comparison

The most striking direct comparison between model generations:

Anthropic tested both models against Firefox 147's JavaScript engine:

Opus 4.6: Hundreds of attempts, 2 working exploits
Mythos Preview: 250 attempts, 181 working exploits + 29 additional instances of register control

The red team blog put it this way: "Last month, we wrote that Opus 4.6 was far better at finding issues than exploiting them. Internal assessments showed Opus 4.6's success rate at autonomous exploit development was essentially zero. Mythos Preview is an entirely different level."

Pricing and Access

Mythos Preview pricing is set at 5x Opus 4.6 rates:

Input: $25 per million tokens
Output: $125 per million tokens

Access through four platforms:

Claude API (direct)
Amazon Bedrock
Google Vertex AI
Microsoft Foundry

The restricted access model reflects both the model's capabilities and the risks documented in the accompanying 244-page System Card.

The Uncomfortable Truth

The red team blog ends with a judgment worth repeating: these capabilities emerged as a downstream result of general improvements in code understanding, reasoning, and autonomy. The same improvements that make AI dramatically better at fixing problems also make it dramatically better at exploiting them.

No specialized training. Pure general intelligence improvement as a side effect.

The global cybersecurity industry loses approximately $500 billion annually to cybercrime. That industry just discovered its biggest emerging threat arrived as a byproduct of someone solving math problems.

Boris Cherny – creator of Claude Code – offered a succinct assessment: "Mythos is very powerful, and it will frighten people."

Conclusion

Claude Mythos Preview is not an incremental update. It is a generational leap. The benchmark data shows this across every dimension – coding, reasoning, mathematics, agent tasks, long context, cybersecurity.

The question is no longer whether these capabilities are real. The question is what happens next – and whether defenders are fast enough to leverage the tools Anthropic is putting in their hands through Project Glasswing.