Benchmark Update — June 15, 2026
📊 Executive Summary
June 2026 has been the most competitive month in AI model history. With 12 distinct frontier/near-frontier LLMs shipping in the first two weeks alone, the gap between open-weight and closed models has compressed to its tightest point ever. This report covers the top five most notable models and findings as of June 15, 2026, based on data from Chatbot Arena, LiveBench, Vellum AI, and independent benchmark evaluations.
🚀 New Model Releases (June 2026)
June's release cadence continues the industry's relentless six-week cycle. The following models arrived this month:
Claude Fable 5 & Claude Mythos 5 — Anthropic
Anthropic introduced a new "Mythos-class" tier sitting above the previous Opus line. Two versions were announced on June 9, 2026 (Anthropic):
- Claude Fable 5 — Public release for paid subscribers and enterprise. Includes safety guardrails. Pricing: $10/$50 per million input/output tokens. 1M context window, Jan 2026 knowledge cutoff. Source: Vellum AI
- Claude Mythos 5 — Restricted to ~40 vetted cyberdefenders via Project Glasswing. Same underlying model with safeguards lifted for defensive security work. Pricing: $25/$125 per million. Source: CORSAIR
Key capability: Mythos 5 is Anthropic's first model to consistently produce novel, compelling scientific hypotheses. In drug design, it accelerated discovery ~10x, with 9 of 14 protein targets yielding strong candidates now under active investigation. Scientists preferred its hypotheses ~80% of the time over Opus-class models. Source: CORSAIR
GPT-5.6 — OpenAI
GPT-5.6 appeared in OpenAI's Codex backend logs before vanishing — a canary leak that triggered massive speculation. As of June 15, 2026, it has not been officially announced, but Polymarket traders price it at 80-89% odds for a June 30 release. Source: DEV Community, Source: WaveSpeed
The model is expected to feature 1.5M context window, improved agentic capabilities, and stronger multimodal support. Source: andrew.ooo
Gemini 3.2 — Google
Gemini 3.2 shipped with a long-context retrieval upgrade, positioning as Google's response to the six-week cadence set by OpenAI and Anthropic. The model now holds the 10M token context window — the longest in the frontier tier. Source: Vellum AI, Source: Presenc AI
DeepSeek V4.1 — DeepSeek
The V4.1 update delivers a 15% per-token cost reduction over the V4 Flash variant, pushing open-weight economics further into the black. At ~$0.28/$1.10 per million input/output tokens, it remains roughly 10-13x cheaper than closed frontier models for equivalent performance. Source: Presenc AI, Source: MindStudio
Other June Releases
| Model | Lab | Key Feature |
|---|---|---|
| Qwen 3.7 | Alibaba | Undercuts DeepSeek V4 Flash on several configurations |
| Llama 4.5 | Meta | Agentic stability improvements |
| Mistral Medium 3 | Mistral AI | EU multilingual mid-tier refresh |
| Hunyuan Large 3 | Tencent | WeChat integration deepens |
| ERNIE 5.1 | Baidu | Baidu Search overview integration |
| Doubao Pro | ByteDance | Douyin creator-economy emphasis |
| GLM-6 | Zhipu AI | Four-horse Chinese open-weight race |
Source: Presenc AI — June 2026 LLM Release Roundup
🏆 Benchmark Highlights
Chatbot Arena Hard — Top 15 (June 2026)
The Arena Hard leaderboard uses curated, harder prompts that better discriminate between top-tier models. As of June 2026:
| Rank | Model | Vendor | Elo |
|---|---|---|---|
| 1 | GPT-5.6 Pro | OpenAI | ~1465 |
| 2 | Claude Mythos 5 | Anthropic | ~1458 |
| 3 | Claude Opus 4.7 | Anthropic | ~1452 |
| 4 | Gemini 3.2 Pro | ~1448 | |
| 5 | GPT-5.6 | OpenAI | ~1440 |
| 6 | Claude Sonnet 4.6 | Anthropic | ~1428 |
| 7 | Gemini 3.2 Flash | ~1418 | |
| 8 | DeepSeek V4.1 Pro | DeepSeek | ~1410 |
| 9 | Qwen 3.7 | Alibaba | ~1400 |
| 10 | GPT-5.6 mini | OpenAI | ~1392 |
| 11 | Grok 4 | xAI | ~1385 |
| 12 | Llama 4.5 Maverick | Meta | ~1370 |
| 13 | GLM-6 | Zhipu AI | ~1360 |
| 14 | Mistral Large 3 | Mistral AI | ~1352 |
| 15 | Kimi K2.6 | Moonshot AI | ~1345 |
Source: Presenc AI — Chatbot Arena Elo Leaderboard June 2026
Key observation: The top eight models are clustered within ~55 Elo points — the tightest spread on record. DeepSeek V4.1 Pro is the highest open-weight entry, within ~55 points of the top closed model.
SWE-bench Verified — Coding Leadership
| Model | Score | Notes |
|---|---|---|
| Claude Mythos 5 | 95.5% | New leader — massive +13.1% leap over Opus 4.6 |
| Claude Fable 5 | 95.0% | Public variant nearly matches restricted Mythos |
| Claude Opus 4.8 | 88.6% | Prior top-tier |
| Claude Opus 4.7 | 87.6% | |
| DeepSeek V4 Pro | ~91.2% | Highest open-weight |
| GPT-5.5 | ~93.5% |
Source: Vellum AI, Source: claudemythos5.vercel.app
GPQA Diamond — Scientific Reasoning
| Rank | Model | Score |
|---|---|---|
| 1 | Claude 3 Opus | 95.4% |
| 2 | Claude Opus 4.7 | 94.2% |
| 3 | Claude Fable 5 | 94.1% |
| 4 | Claude Mythos 5 | 94.1% |
| 5 | Claude Opus 4.8 | 93.6% |
Note: Claude dominates this benchmark entirely — five of the top five slots. This suggests Claude's architecture may have a structural advantage on scientific reasoning tasks.
LiveCodeBench & Codeforces — Coding Competitions
| Model | LiveCodeBench | Codeforces |
|---|---|---|
| DeepSeek V4-Pro | 93.5 | 3206 |
| MiniMax M3 | SWE-Bench Pro: 59.0% | |
| Kimi K2.6 | SWE-Bench Pro: 58.6% |
Source: Kilo AI — Best Open-Source Coding Models 2026
Key finding: DeepSeek V4-Pro leads all evaluated models on LiveCodeBench and Codeforces, including closed frontier APIs. This is a first for an open-weight model.
Additional Benchmark Data
USAMO (Math Olympiad): Claude Mythos 5 — 97.6%, a massive leap from the previous best of ~42%. Source: claudemythos5.vercel.app
Cybench (CTF Challenges): Claude Mythos 5 — 100% perfect score. Source: claudemythos5.vercel.app
Humanity's Last Exam: Claude Mythos 5 — 64.5%, followed by Claude Opus 4.8 at 57.9% and Gemini 3 Pro at 45.8%. Source: Vellum AI
ARC-AGI 2 (Visual Reasoning): GPT-5.5 — 85.0%, followed by Claude Opus 4.6 at 68.8%. Source: Vellum AI
🗣️ Community Feedback
Reddit Discussions
r/LLMDevs: The thread "GPT-5.6 and Claude Mythos/Opus 5 might be closer than expected" (May 18, 2026) highlights a consensus that GPT-5.5 performance is "just as good — and just as far ahead of the trend — if not very [ahead]" of Claude Mythos on some metrics. Community members note that Anthropic may keep Mythos restricted for safety reasons, which could limit its real-world impact despite superior benchmarks. Source: Reddit r/LLMDevs
r/codex: Developers are actively weighing whether to invest in Claude Pro (x5 pricing tier) vs. waiting for GPT-5.6. One prominent comment: "I think GPT 5.6 will be way better than Mythos/Fable and as [accessible]." Source: Reddit r/codex
r/vibecoding: "Claude Mythos sounds more interesting if the cyber and reasoning rumors are accurate, but Anthropic may keep it limited for safety reasons." Source: Reddit r/vibecoding
Hacker News
- 380 pts: Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model — the definitive autopsy of a "sovereign AI" claim. Source: HN
- 338 pts: Apple Foundation Models platform documentation surfaced. Source: HN
Industry Analysts
Andon Labs tested the unblocked Mythos 5 model on Vending-Bench (a long-horizon agentic-business evaluation) and reported a more skeptical picture: "On the benchmark it made less money than both Opus 4.7 and GPT-5.5, and its alignment looked like a step back toward older Claude behavior." Source: Vellum
Simon Willison (Oxide and Friends podcast, January 2026) predicted that "there are still people out there who are convinced that LLMs cannot write good code. Those people are in for a very nasty shock in 2026." His prediction has proven prescient — Claude Mythos 5 now scores 95.5% on SWE-bench Verified, essentially matching expert human software engineers on real GitHub issues. Source: simonwillison.net
🔍 Worth Noting Analysis
1. The Open-Weight Revolution Is Real
DeepSeek V4-Pro achieving 93.5 on LiveCodeBench and 3206 on Codeforces — ahead of all closed frontier APIs — is a watershed moment. Combined with MIT licensing, 85% cost reduction vs. GPT-5.5, and self-hosting capability, the economic case for open-weight models has fundamentally shifted. For high-volume coding workflows, the math is clear: DeepSeek V4-Pro at ~$1.10/M output tokens vs. GPT-5.5 at ~$15/M output tokens is a 13x cost advantage for equivalent coding performance. Source: MindStudio
2. Chinese Frontier Convergence
A credible four-horse race has emerged in the Chinese frontier: Qwen 3.7, DeepSeek V4.1, Hunyuan Large 3, and GLM-6. This convergence is not coincidental — the clustering of releases reflects coordinated competitive response to DeepSeek V4's April 2026 benchmark-setting launch. Combined with consumer-surface models from Baidu (ERNIE 5.1) and ByteDance (Doubao Pro), the Chinese AI ecosystem now presents a complete stack from frontier reasoning to consumer applications. Source: Presenc AI
Stanford HAI's 2026 AI Index confirms: "The U.S.-China AI model performance gap has effectively closed. U.S. and Chinese models are now within measurement error on most benchmarks." Source: Stanford HAI
3. Safety-Driven Access Restriction
Claude Mythos 5 tripped Anthropic's ASL-4 safety thresholds, leading to its restricted access under Project Glasswing. With only ~40 vetted organizations able to use it, and a $100 million credit pool, Mythos 5's real-world impact will be measured in vulnerability discoveries and patches, not in API call volume. The White House has been briefed on national security implications. This creates an unusual dynamic: the best-performing model on most benchmarks is the most restricted. Source: claudemythos5.vercel.app
4. The Record-Tight Elo Cluster
The ~55 Elo spread across the top eight models is the tightest spread in Arena history. This means:
- Model selection decisions matter less for general tasks — the top eight will feel comparable in day-to-day use
- Specialized benchmarks (SWE-bench, GPQA, LiveCodeBench) are increasingly more important than general Arena Elo for model selection
- The industry is approaching a performance plateau on general conversational benchmarks
- Future gains will come from specialization (coding, science, security) rather than general intelligence improvements
5. Use-Case Archetype Segmentation
Anthropic's launch of Fable 5 (public) alongside Mythos 5 (restricted) — both from the same underlying model — signals that frontier labs are now segmenting by use-case archetype rather than solely by scale tier. This mirrors the traditional SaaS market segmentation and suggests we'll see more purpose-built variants (creative, coding, scientific, security) rather than monolithic upgrades. Source: Presenc AI
📋 Model Specifications Comparison
| Model | Parameters | Context | Input $/M | Output $/M | Arena Hard Elo |
|---|---|---|---|---|---|
| GPT-5.6 Pro | Undisclosed | 1M | $30 | $180 | ~1465 |
| Claude Mythos 5 | ~10T (MoE) | 1M | $25 | $125 | ~1458 |
| Claude Fable 5 | ~10T (MoE) | 1M | $10 | $50 | N/A |
| Claude Opus 4.7 | Undisclosed | 1M | $5 | $25 | ~1452 |
| Gemini 3.2 Pro | Undisclosed | 10M | $2 | $12 | ~1448 |
| DeepSeek V4.1 Pro | ~671B (37B active) | 256K | $0.28 | $1.10 | ~1410 |
| Qwen 3.7 | Undisclosed | Undisclosed | Undisclosed | Undisclosed | ~1400 |
| Llama 4.5 Maverick | Undisclosed | 10M | $0.20 | $0.60 | ~1370 |
Sources: Vellum AI, claudemythos5.vercel.app, MindStudio, Presenc AI
Report compiled: June 15, 2026 | Sources: Presenc AI, Vellum AI, LiveBench, claudemythos5.vercel.app, CORSAIR, MindStudio, Kilo AI, Stanford HAI, Reddit, Hacker News