Benchmark Update — June 15, 2026

📊 Executive Summary

June 2026 has been the most competitive month in AI model history. With 12 distinct frontier/near-frontier LLMs shipping in the first two weeks alone, the gap between open-weight and closed models has compressed to its tightest point ever. This report covers the top five most notable models and findings as of June 15, 2026, based on data from Chatbot Arena, LiveBench, Vellum AI, and independent benchmark evaluations.


🚀 New Model Releases (June 2026)

June's release cadence continues the industry's relentless six-week cycle. The following models arrived this month:

Claude Fable 5 & Claude Mythos 5 — Anthropic

Anthropic introduced a new "Mythos-class" tier sitting above the previous Opus line. Two versions were announced on June 9, 2026 (Anthropic):

Key capability: Mythos 5 is Anthropic's first model to consistently produce novel, compelling scientific hypotheses. In drug design, it accelerated discovery ~10x, with 9 of 14 protein targets yielding strong candidates now under active investigation. Scientists preferred its hypotheses ~80% of the time over Opus-class models. Source: CORSAIR

GPT-5.6 — OpenAI

GPT-5.6 appeared in OpenAI's Codex backend logs before vanishing — a canary leak that triggered massive speculation. As of June 15, 2026, it has not been officially announced, but Polymarket traders price it at 80-89% odds for a June 30 release. Source: DEV Community, Source: WaveSpeed

The model is expected to feature 1.5M context window, improved agentic capabilities, and stronger multimodal support. Source: andrew.ooo

Gemini 3.2 — Google

Gemini 3.2 shipped with a long-context retrieval upgrade, positioning as Google's response to the six-week cadence set by OpenAI and Anthropic. The model now holds the 10M token context window — the longest in the frontier tier. Source: Vellum AI, Source: Presenc AI

DeepSeek V4.1 — DeepSeek

The V4.1 update delivers a 15% per-token cost reduction over the V4 Flash variant, pushing open-weight economics further into the black. At ~$0.28/$1.10 per million input/output tokens, it remains roughly 10-13x cheaper than closed frontier models for equivalent performance. Source: Presenc AI, Source: MindStudio

Other June Releases

Model Lab Key Feature
Qwen 3.7 Alibaba Undercuts DeepSeek V4 Flash on several configurations
Llama 4.5 Meta Agentic stability improvements
Mistral Medium 3 Mistral AI EU multilingual mid-tier refresh
Hunyuan Large 3 Tencent WeChat integration deepens
ERNIE 5.1 Baidu Baidu Search overview integration
Doubao Pro ByteDance Douyin creator-economy emphasis
GLM-6 Zhipu AI Four-horse Chinese open-weight race

Source: Presenc AI — June 2026 LLM Release Roundup


🏆 Benchmark Highlights

Chatbot Arena Hard — Top 15 (June 2026)

The Arena Hard leaderboard uses curated, harder prompts that better discriminate between top-tier models. As of June 2026:

Rank Model Vendor Elo
1 GPT-5.6 Pro OpenAI ~1465
2 Claude Mythos 5 Anthropic ~1458
3 Claude Opus 4.7 Anthropic ~1452
4 Gemini 3.2 Pro Google ~1448
5 GPT-5.6 OpenAI ~1440
6 Claude Sonnet 4.6 Anthropic ~1428
7 Gemini 3.2 Flash Google ~1418
8 DeepSeek V4.1 Pro DeepSeek ~1410
9 Qwen 3.7 Alibaba ~1400
10 GPT-5.6 mini OpenAI ~1392
11 Grok 4 xAI ~1385
12 Llama 4.5 Maverick Meta ~1370
13 GLM-6 Zhipu AI ~1360
14 Mistral Large 3 Mistral AI ~1352
15 Kimi K2.6 Moonshot AI ~1345

Source: Presenc AI — Chatbot Arena Elo Leaderboard June 2026

Key observation: The top eight models are clustered within ~55 Elo points — the tightest spread on record. DeepSeek V4.1 Pro is the highest open-weight entry, within ~55 points of the top closed model.

SWE-bench Verified — Coding Leadership

Model Score Notes
Claude Mythos 5 95.5% New leader — massive +13.1% leap over Opus 4.6
Claude Fable 5 95.0% Public variant nearly matches restricted Mythos
Claude Opus 4.8 88.6% Prior top-tier
Claude Opus 4.7 87.6%
DeepSeek V4 Pro ~91.2% Highest open-weight
GPT-5.5 ~93.5%

Source: Vellum AI, Source: claudemythos5.vercel.app

GPQA Diamond — Scientific Reasoning

Rank Model Score
1 Claude 3 Opus 95.4%
2 Claude Opus 4.7 94.2%
3 Claude Fable 5 94.1%
4 Claude Mythos 5 94.1%
5 Claude Opus 4.8 93.6%

Source: Vellum AI

Note: Claude dominates this benchmark entirely — five of the top five slots. This suggests Claude's architecture may have a structural advantage on scientific reasoning tasks.

LiveCodeBench & Codeforces — Coding Competitions

Model LiveCodeBench Codeforces
DeepSeek V4-Pro 93.5 3206
MiniMax M3 SWE-Bench Pro: 59.0%
Kimi K2.6 SWE-Bench Pro: 58.6%

Source: Kilo AI — Best Open-Source Coding Models 2026

Key finding: DeepSeek V4-Pro leads all evaluated models on LiveCodeBench and Codeforces, including closed frontier APIs. This is a first for an open-weight model.

Additional Benchmark Data

USAMO (Math Olympiad): Claude Mythos 5 — 97.6%, a massive leap from the previous best of ~42%. Source: claudemythos5.vercel.app

Cybench (CTF Challenges): Claude Mythos 5 — 100% perfect score. Source: claudemythos5.vercel.app

Humanity's Last Exam: Claude Mythos 5 — 64.5%, followed by Claude Opus 4.8 at 57.9% and Gemini 3 Pro at 45.8%. Source: Vellum AI

ARC-AGI 2 (Visual Reasoning): GPT-5.5 — 85.0%, followed by Claude Opus 4.6 at 68.8%. Source: Vellum AI


🗣️ Community Feedback

Reddit Discussions

r/LLMDevs: The thread "GPT-5.6 and Claude Mythos/Opus 5 might be closer than expected" (May 18, 2026) highlights a consensus that GPT-5.5 performance is "just as good — and just as far ahead of the trend — if not very [ahead]" of Claude Mythos on some metrics. Community members note that Anthropic may keep Mythos restricted for safety reasons, which could limit its real-world impact despite superior benchmarks. Source: Reddit r/LLMDevs

r/codex: Developers are actively weighing whether to invest in Claude Pro (x5 pricing tier) vs. waiting for GPT-5.6. One prominent comment: "I think GPT 5.6 will be way better than Mythos/Fable and as [accessible]." Source: Reddit r/codex

r/vibecoding: "Claude Mythos sounds more interesting if the cyber and reasoning rumors are accurate, but Anthropic may keep it limited for safety reasons." Source: Reddit r/vibecoding

Hacker News

Industry Analysts

Andon Labs tested the unblocked Mythos 5 model on Vending-Bench (a long-horizon agentic-business evaluation) and reported a more skeptical picture: "On the benchmark it made less money than both Opus 4.7 and GPT-5.5, and its alignment looked like a step back toward older Claude behavior." Source: Vellum

Simon Willison (Oxide and Friends podcast, January 2026) predicted that "there are still people out there who are convinced that LLMs cannot write good code. Those people are in for a very nasty shock in 2026." His prediction has proven prescient — Claude Mythos 5 now scores 95.5% on SWE-bench Verified, essentially matching expert human software engineers on real GitHub issues. Source: simonwillison.net


🔍 Worth Noting Analysis

1. The Open-Weight Revolution Is Real

DeepSeek V4-Pro achieving 93.5 on LiveCodeBench and 3206 on Codeforces — ahead of all closed frontier APIs — is a watershed moment. Combined with MIT licensing, 85% cost reduction vs. GPT-5.5, and self-hosting capability, the economic case for open-weight models has fundamentally shifted. For high-volume coding workflows, the math is clear: DeepSeek V4-Pro at ~$1.10/M output tokens vs. GPT-5.5 at ~$15/M output tokens is a 13x cost advantage for equivalent coding performance. Source: MindStudio

2. Chinese Frontier Convergence

A credible four-horse race has emerged in the Chinese frontier: Qwen 3.7, DeepSeek V4.1, Hunyuan Large 3, and GLM-6. This convergence is not coincidental — the clustering of releases reflects coordinated competitive response to DeepSeek V4's April 2026 benchmark-setting launch. Combined with consumer-surface models from Baidu (ERNIE 5.1) and ByteDance (Doubao Pro), the Chinese AI ecosystem now presents a complete stack from frontier reasoning to consumer applications. Source: Presenc AI

Stanford HAI's 2026 AI Index confirms: "The U.S.-China AI model performance gap has effectively closed. U.S. and Chinese models are now within measurement error on most benchmarks." Source: Stanford HAI

3. Safety-Driven Access Restriction

Claude Mythos 5 tripped Anthropic's ASL-4 safety thresholds, leading to its restricted access under Project Glasswing. With only ~40 vetted organizations able to use it, and a $100 million credit pool, Mythos 5's real-world impact will be measured in vulnerability discoveries and patches, not in API call volume. The White House has been briefed on national security implications. This creates an unusual dynamic: the best-performing model on most benchmarks is the most restricted. Source: claudemythos5.vercel.app

4. The Record-Tight Elo Cluster

The ~55 Elo spread across the top eight models is the tightest spread in Arena history. This means:

5. Use-Case Archetype Segmentation

Anthropic's launch of Fable 5 (public) alongside Mythos 5 (restricted) — both from the same underlying model — signals that frontier labs are now segmenting by use-case archetype rather than solely by scale tier. This mirrors the traditional SaaS market segmentation and suggests we'll see more purpose-built variants (creative, coding, scientific, security) rather than monolithic upgrades. Source: Presenc AI


📋 Model Specifications Comparison

Model Parameters Context Input $/M Output $/M Arena Hard Elo
GPT-5.6 Pro Undisclosed 1M $30 $180 ~1465
Claude Mythos 5 ~10T (MoE) 1M $25 $125 ~1458
Claude Fable 5 ~10T (MoE) 1M $10 $50 N/A
Claude Opus 4.7 Undisclosed 1M $5 $25 ~1452
Gemini 3.2 Pro Undisclosed 10M $2 $12 ~1448
DeepSeek V4.1 Pro ~671B (37B active) 256K $0.28 $1.10 ~1410
Qwen 3.7 Undisclosed Undisclosed Undisclosed Undisclosed ~1400
Llama 4.5 Maverick Undisclosed 10M $0.20 $0.60 ~1370

Sources: Vellum AI, claudemythos5.vercel.app, MindStudio, Presenc AI


Report compiled: June 15, 2026 | Sources: Presenc AI, Vellum AI, LiveBench, claudemythos5.vercel.app, CORSAIR, MindStudio, Kilo AI, Stanford HAI, Reddit, Hacker News

benchmarksmodel-releasesarenaswe-bench