Model Intelligence — 2026-06-15
🔥 Top Stories
1. vLLM v0.23.0 — DeepSeek-V4 Gets Production Green Light
vLLM v0.23.0 dropped today with 408 commits from 200 contributors. The headline is the TRTLLM-gen attention kernel for DeepSeek-V4, making it genuinely production-ready.
Key additions:
- DeepSeek-V4 hardening: sparse MLA metadata decoupled from V3.2, EPLB support for Mega-MoE, XPU attention decode path, selective prefix-cache retention for sliding-window KV cache
- Model Runner V2 now default for Llama and Mistral dense models (plus Qwen3), with FlashInfer sampler and breakable CUDA graphs
- Rust frontend grew streaming
generate, dynamic LoRA endpoints, request-ID headers - Gemma 4 Unified encoder-free support + MTP
- Transformers v5 compatibility
Signal: The TRTLLM kernel is the missing piece for production DeepSeek-V4 serving. Between vLLM v0.23.0 and SGLang v0.5.13's full V4 inference path, the ecosystem now has two battle-tested options. Minimax M3 not yet supported — expect v0.24.
2. llama.cpp Hits b9660 — Chat & Tool Calling Hardened
llama.cpp reached b9660 today (from b9637 yesterday), spanning 13+ commits. Five builds landed today alone:
- b9655 — Grammar generator bug fix that surfaced during recent changes (#24653)
- b9656 — Hardened peg-native tool call parsing with optional leading type support (#24329)
- b9658 — Full unparsed prompt included in debug output (#24650)
- b9659 — Continued hardening
- b9660 — Latest build of the sprint
Also in the day's span: Metal bf16 repeat (Mac Studio M-series), SYCL Level Zero optimization (Intel GPUs), WebGPU i-quants performance, HEIC/HEIF image support, thinking/reasoning block rendering as markdown (#24611).
Signal: The tool call parsing hardening is quietly critical — if you're running function-calling models through llama.cpp, b9656+ is the update to grab. The reasoning-block markdown rendering matters for DeepSeek-R1 and chain-of-thought workflows.
3. Ollama v0.30.9-rc1 — Rolling Up to b9637
Ollama v0.30.9-rc1 landed today, updating the bundled llama.cpp to b9637. This means the Cohere2MoE parser and recent template fixes from yesterday's llama.cpp sprint are now in the Ollama pipeline.
Signal: RC releases typically ship within a week. If you're on Ollama for production, the v0.30.8→v0.30.9 upgrade brings 2+ days of llama.cpp fixes into a stable package.
📊 Model Trends
HuggingFace Top 15
| Rank | Model | Likes | Downloads |
|---|---|---|---|
| 1 | DeepSeek-R1 | 13,394 | 4.5M |
| 2 | FLUX.1-dev | 13,208 | 587K |
| 3 | SDXL 1.0 | 7,819 | 1.0M |
| 4 | SD 1.4 | 7,020 | 304K |
| 5 | Llama-3-8B | 6,579 | 896K |
| 6 | Kokoro-82M | 6,335 | 11.7M 🔥 |
| 7 | Llama-3.1-8B-Instruct | 6,086 | 6.6M |
| 8 | Whisper-large-v3 | 5,824 | 4.1M |
| 9 | FLUX.1-schnell | 5,130 | 220K |
| 10 | bloom | 5,011 | 3.3K |
| 11 | SD3-medium | 4,975 | 2.0K |
| 12 | MiniLM-L6-v2 | 4,950 | 167M |
| 13 | gpt-oss-120b | 4,888 | 2.9M |
| 14 | DeepSeek-V4-Pro | 4,867 | 2.9M |
| 15 | Z-Image-Turbo | 4,808 | 637K |
Standouts: Kokoro-82M jumped +600K downloads to 11.7M — this is embedded-in-production velocity, not hobbyist testing. DeepSeek-V4-Pro climbing steadily at 4,857 likes. MiniLM-L6-v2 at 167M downloads remains the silent workhorse of embedding.
Qwen Family
| Model | Likes | Notes |
|---|---|---|
| QwQ-32B | 2,931 | Top Qwen |
| Qwen3.5-27B Claude-4.6 distill | 2,881 | Community reasoning distill |
| Qwen-Image | 2,511 | Text-to-image |
| Qwen-Image-Edit | 2,424 | Image editing |
| Qwen3.6-35B-A3B | 2,121 | MoE value king, 3.3M downloads |
Qwen3.6-35B-A3B remains the practical recommendation: MoE with 3B active params out of 35B total, runs on consumer hardware.
Gemma Family
| Model | Likes | Downloads |
|---|---|---|
| gemma-7b | 3,357 | 24K |
| gemma-4-31B-it | 2,992 | 7.5M |
| gemma-3-27b-it | 1,979 | 971K |
| gemma-3-4b-it | 1,368 | 1.1M |
gemma-4-31B-it at 7.5M downloads — Google's flagship with solid local inference support via Ollama's QAT weights.
⚙️ Engine Updates
| Engine | Version | Date | Status |
|---|---|---|---|
| llama.cpp | b9660 | Jun 15 | ⬆️ +5 today |
| Ollama | v0.30.9-rc1 | Jun 15 | ⬆️ NEW RC |
| vLLM | v0.23.0 | Jun 15 | ⬆️ RELEASED TODAY |
| SGLang | v0.5.13 | Jun 13 | — |
Three engines shipped today — the busiest single day in recent weeks. If you haven't updated any inference stack in the last 48 hours, do it now.
📰 AI News (Hacker News)
- [514 pts] CrankGPT — crankgpt.com — New AI tool with massive HN engagement
- [449 pts] Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding? — HN — The question everyone's asking
- [100 pts] Can Europe train a frontier AI model on the compute it owns? — euromesh — European AI sovereignty exploration
The coding replacement thread (449 pts) is the most useful discussion — real practitioners sharing what works for local-first dev workflows.
🔄 What Changed Since Yesterday
| Area | Yesterday (Jun 14) | Today (Jun 15) | Change |
|---|---|---|---|
| llama.cpp | b9637 | b9660 | +5 builds |
| Ollama | v0.30.8 | v0.30.9-rc1 | +1 RC |
| vLLM | v0.23.0 | v0.23.0 | — (same release) |
| SGLang | v0.5.13 | v0.5.13 | — |
| Kokoro-82M downloads | 11.1M | 11.7M | +600K 🔥 |
| DeepSeek-V4-Pro likes | 4,824 | 4,867 | +43 |
| DeepSeek-R1 likes | 13,390 | 13,394 | +4 |
| FLUX.1-dev likes | 13,194 | 13,208 | +14 |
| Llama-3.1-8B-Instruct likes | 6,075 | 6,086 | +11 |
| gpt-oss-120b likes | 4,883 | 4,888 | +5 |
| QwQ-32B likes | 2,930 | 2,931 | +1 |
| gemma-4-31B-it likes | 2,979 | 2,992 | +13 |
Local Inference Recommendations
RTX 3060 (12GB):
- Qwen3.6-35B-A3B at Q4_K_M — MoE efficiency still unbeatable at 3B active params
- Gemma-4-31B-it at Q4 — 7.5M downloads validate the quality
- Kokoro-82M for TTS — production-grade, runs in milliseconds
RTX 3090 (24GB):
- DeepSeek-V4-Pro via vLLM v0.23.0 — TRTLLM kernel makes production serving practical now
- Qwen3.6-35B-A3B at Q6/Q8 — full quality with context headroom
- Gemma-4-31B-it at Q6 — high quality, extended conversations
- gpt-oss-120b at Q2_K — experimental but functional on 24GB
Key takeaway: Three engines shipping today makes this a must-update day. vLLM v0.23.0's TRTLLM kernel is the most actionable release — it unlocks production DeepSeek-V4 serving. llama.cpp's tool call parsing hardening (b9656) matters for function-calling workflows. Ollama's RC brings the day's llama.cpp fixes into a stable packaging pipeline.
Scan completed: 2026-06-15 | Sources: HuggingFace API, llama.cpp GitHub, Ollama GitHub, vLLM GitHub, SGLang GitHub, Hacker News