Claude Mythos Preview: Benchmarks, Exploit Chains, and the Technical Deep Dive

    Claude Mythos Preview: Benchmarks, Exploit Chains, and the Technical Deep Dive

    Till FreitagTill Freitag11. April 20267 min Lesezeit
    Till Freitag

    TL;DR: „Mythos Preview doubles Opus 4.6 on SWE-bench Pro, solves all 35 CTF challenges on the first attempt, and autonomously developed a 6-stage ROP chain exploit for remote root on FreeBSD. Cost for the OpenBSD scan: under $20,000."

    — Till Freitag

    The Benchmark Data in Detail

    On April 7, 2026, Anthropic published the complete benchmark data for Claude Mythos Preview. The numbers confirm what the March leak hinted at – and exceed it.

    Coding Benchmarks

    Benchmark Mythos Preview Opus 4.6 GPT-5.4
    SWE-bench Verified 93.9% 80.8%
    SWE-bench Pro 77.8% 53.4% 57.7%
    Terminal-Bench 2.0 82.0% 65.4%
    SWE-bench Multimodal 59.0% 27.1%

    SWE-bench Verified: 13.1 percentage point lead. SWE-bench Pro: 24.4 points. Most dramatic is SWE-bench Multimodal – Mythos more than doubles Opus 4.6's score.

    Reasoning and Mathematics

    Benchmark Mythos Preview Opus 4.6 GPT-5.4
    GPQA Diamond 94.6%
    HLE (with tools) 64.7% 53.1%
    USAMO 2026 97.6% 42.3%

    The USAMO results are the most striking gap in the entire dataset: 97.6% vs. 42.3% – a 55.3 percentage point difference on a competitive mathematics exam.

    Agentic Tasks and Long Context

    Benchmark Mythos Preview Opus 4.6 GPT-5.4
    OSWorld 79.6%
    BrowseComp 86.9%
    GraphWalks (256K–1M tokens) 80.0% 38.7% 21.4%

    GraphWalks tests reasoning over extremely long contexts from 256K to 1 million tokens. Mythos scores 80.0% – a 4x improvement over GPT-5.4.

    Cybersecurity

    Benchmark Mythos Preview Opus 4.6
    CyberGym 83.1% 66.6%
    Cybench (pass@1, 10 attempts) 100%

    Cybench comprises 35 CTF challenges. Mythos Preview solved every single one on the first attempt across 10 runs.

    The Three Historic Vulnerabilities

    The aggregate numbers are impressive. But only the concrete cases make the scale tangible.

    Case 1: OpenBSD – 27-Year TCP SACK Vulnerability

    OpenBSD is widely considered one of the most hardened operating systems in existence. It runs on firewalls and critical infrastructure worldwide. Its codebase has been subjected to continuous security auditing for decades.

    Mythos Preview found a vulnerability in OpenBSD's TCP SACK implementation that had been present since 1998.

    The bug is extraordinarily subtle, involving the interaction of two independent flaws:

    Flaw 1: The SACK protocol allows receivers to selectively acknowledge received data packet ranges. OpenBSD's implementation checked only the upper bound of SACK ranges, not the lower bound. This alone is typically harmless.

    Flaw 2: Under specific conditions, a null pointer write can be triggered. But under normal circumstances, this code path is unreachable because it requires two mutually exclusive conditions to be satisfied simultaneously.

    The breakthrough: TCP sequence numbers are 32-bit signed integers. Mythos Preview discovered that by setting the SACK starting point approximately 2^31 away from the normal window using Flaw 1, two comparison operations simultaneously overflow the sign bit. The kernel is tricked – the "impossible" conditions are both satisfied, and the null pointer write fires.

    Impact: Anyone who can connect to the target machine can remotely crash it.

    27 years. Countless manual audits and automated scans. Nobody found it. Total cost for the scanning project: under $20,000 – roughly one week's salary for a senior penetration testing engineer.

    Case 2: FFmpeg – 16-Year H.264 Decoder Vulnerability

    FFmpeg is the most widely used video codec library in the world and one of the most thoroughly fuzzed open-source projects in existence.

    Mythos Preview found a vulnerability in the H.264 decoder introduced in 2010, with roots traceable to 2003.

    The issue is a seemingly innocent type mismatch: the table recording slice assignments uses 16-bit integers. The slice counter itself is a 32-bit int.

    Normal video frames contain only a few slices, so the 16-bit upper limit of 65,536 is never reached. The table is initialized using memset(..., -1, ...), making 65,535 the sentinel value for "empty slot."

    The attack: Construct a video frame containing 65,536 slices. Slice number 65,535 collides with the sentinel value. The decoder misidentifies it as an empty slot, triggering an out-of-bounds write.

    In the 16 years since introduction, automated fuzzers executed 5 million runs on this line of code without triggering it. The trigger condition – a frame with exactly 65,536 slices – is astronomically unlikely through random fuzzing but trivial to construct deliberately.

    Case 3: FreeBSD NFS – 17-Year Remote Root (CVE-2026-4747)

    This is the case that made security researchers' blood run cold.

    Mythos Preview fully autonomously discovered and exploited a remote code execution vulnerability in FreeBSD's NFS server that had existed for 17 years.

    "Fully autonomous" means that after the initial prompt, no human was involved in any stage of discovery or exploit development.

    Impact: An attacker from anywhere on the internet can gain complete root privileges on the target server without any authentication.

    The vulnerability: A stack buffer overflow in the NFS server's authentication request handler. Attacker-controlled data is copied directly into a 128-byte stack buffer, but the length check allows up to 400 bytes.

    Why existing protections failed: FreeBSD's kernel is compiled with -fstack-protector, but this option only protects functions containing char arrays. This buffer was declared as int32_t[32] – the compiler does not insert a stack canary. FreeBSD also does not implement kernel address space layout randomization.

    The exploit: The complete ROP chain exceeds 1,000 bytes, but the stack overflow provides only 200 bytes of space. Mythos Preview's solution: split the attack across 6 consecutive RPC requests. The first 5 write data blocks into kernel memory. The 6th triggers the final payload, appending the attacker's SSH public key to /root/.ssh/authorized_keys.

    For comparison, an independent security research firm previously demonstrated that Opus 4.6 could also exploit this same vulnerability – but only with human guidance. Mythos Preview needed none.

    Beyond These Three

    In addition to these three patched cases, Anthropic's red team blog disclosed SHA-3 hash commitments for a large number of unpatched vulnerabilities spanning every major operating system, every major browser, and multiple cryptographic libraries. Over 99% remain unpatched and cannot be publicly disclosed.

    Another test: Mythos Preview was given a list of 100 known CVEs, asked to identify the 40 most exploitable, and write privilege escalation exploits for each. Success rate: over 50%.

    One exploit started from a 1-bit adjacent physical page write primitive. Through precise kernel memory layout manipulation – slab spraying, page table page alignment, and PTE permission bit flipping – it ultimately rewrote the first page of /usr/bin/passwd with a 168-byte ELF stub calling setuid(0) for root access. Total cost: under $1,000.

    The Firefox 147 Comparison

    The most striking direct comparison between model generations:

    Anthropic tested both models against Firefox 147's JavaScript engine:

    • Opus 4.6: Hundreds of attempts, 2 working exploits
    • Mythos Preview: 250 attempts, 181 working exploits + 29 additional instances of register control

    The red team blog put it this way: "Last month, we wrote that Opus 4.6 was far better at finding issues than exploiting them. Internal assessments showed Opus 4.6's success rate at autonomous exploit development was essentially zero. Mythos Preview is an entirely different level."

    Pricing and Access

    Mythos Preview pricing is set at 5x Opus 4.6 rates:

    • Input: $25 per million tokens
    • Output: $125 per million tokens

    Access through four platforms:

    • Claude API (direct)
    • Amazon Bedrock
    • Google Vertex AI
    • Microsoft Foundry

    The restricted access model reflects both the model's capabilities and the risks documented in the accompanying 244-page System Card.

    The Uncomfortable Truth

    The red team blog ends with a judgment worth repeating: these capabilities emerged as a downstream result of general improvements in code understanding, reasoning, and autonomy. The same improvements that make AI dramatically better at fixing problems also make it dramatically better at exploiting them.

    No specialized training. Pure general intelligence improvement as a side effect.

    The global cybersecurity industry loses approximately $500 billion annually to cybercrime. That industry just discovered its biggest emerging threat arrived as a byproduct of someone solving math problems.

    Boris Cherny – creator of Claude Code – offered a succinct assessment: "Mythos is very powerful, and it will frighten people."

    Conclusion

    Claude Mythos Preview is not an incremental update. It is a generational leap. The benchmark data shows this across every dimension – coding, reasoning, mathematics, agent tasks, long context, cybersecurity.

    The question is no longer whether these capabilities are real. The question is what happens next – and whether defenders are fast enough to leverage the tools Anthropic is putting in their hands through Project Glasswing.

    TeilenLinkedInWhatsAppE-Mail

    Verwandte Artikel

    Claude Mythos & Project Glasswing: When AI Gets Too Good at Hacking, It Becomes the Defenders' Weapon
    11. April 20264 min

    Claude Mythos & Project Glasswing: When AI Gets Too Good at Hacking, It Becomes the Defenders' Weapon

    Anthropic's new frontier model Claude Mythos Preview is so good at finding vulnerabilities that it won't be publicly rel…

    Weiterlesen
    Chess pieces as a metaphor for the platform conflict between Anthropic and Lovable
    14. April 20263 min

    Anthropic Is Building an App Builder – And It's Coming for Europe's Vibe-Coding Star Lovable

    Leaked screenshots reveal an integrated app builder inside Claude. What this means for Lovable, the European startup eco…

    Weiterlesen
    The AI Race in 31 Milestones: The Complete OpenAI vs. Anthropic Timeline
    11. April 20262 min

    The AI Race in 31 Milestones: The Complete OpenAI vs. Anthropic Timeline

    From GPT-4o to Project Glasswing: Every acquisition, model launch, and product release from OpenAI and Anthropic on an i…

    Weiterlesen
    OpenAI Buys a TV Show. Anthropic Builds the Future of Software. And Google? It's Playing a Different Game Entirely.
    11. April 20266 min

    OpenAI Buys a TV Show. Anthropic Builds the Future of Software. And Google? It's Playing a Different Game Entirely.

    OpenAI buys TBPN, a Jony Ive hardware startup, and builds a desktop superapp. Anthropic turns Claude into a Developer OS…

    Weiterlesen
    Claude Managed Agents architecture – brain connected to multiple hands representing tools and sandboxes
    8. April 20265 min

    Claude Managed Agents: Anthropic's Play to Own the Agent Runtime

    Anthropic launches Managed Agents in public beta – a hosted runtime that decouples the 'brain' from the 'hands.' No more…

    Weiterlesen
    OpenClaw Pricing Shock: How to Avoid the $500 Bill
    5. April 20262 min

    OpenClaw Pricing Shock: How to Avoid the $500 Bill

    Anthropic just killed third-party tool coverage under Claude subscriptions. If you're running OpenClaw without prep, you…

    Weiterlesen
    Three architectures compared – structured grid, open mesh, and neural network as symbols for Copilot, OpenClaw, and ClaudeDeep Dive
    4. April 20268 min

    Copilot vs. OpenClaw vs. Claude: Enterprise AI Agents Compared 2026

    Three philosophies, one goal: AI agents in the enterprise. Microsoft Copilot (platform), OpenClaw (open source), Claude …

    Weiterlesen
    From Chat to Workflow: How Anthropic Is Turning Claude Into a Digital Coworker
    30. März 20262 min

    From Chat to Workflow: How Anthropic Is Turning Claude Into a Digital Coworker

    Dispatch, Computer Use, persistent tasks – Anthropic is layering capabilities in an order that's no accident. A strategi…

    Weiterlesen
    Smartphone sending a task to a desktop computer where an AI agent works autonomously
    22. März 20264 min

    Claude Dispatch: Your AI Agent Works While You're Away

    Anthropic launched Dispatch – turning Claude from a chatbot into a digital coworker. Send a task from your phone, Claude…

    Weiterlesen