Model Intelligence — 2026-06-13
🔥 Top Stories
1. vLLM v0.23.0: DeepSeek-V4 Hardening, Model Runner V2 Goes Wider
vLLM v0.23.0 shipped on June 12 with 408 commits from 200 contributors (63 new). This is a substantial release that matters for anyone running local or edge inference.
DeepSeek-V4 gets serious production treatment. Following its introduction in v0.22.0, the V4 architecture received a comprehensive optimization pass: sparse MLA metadata decoupled from V3.2, a new TRTLLM-gen attention kernel, EPLB support for the Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP. The model was also detached from torch.compile, and its attention/RoPE paths were refactored for better performance. An XPU attention decode path was added — meaning Intel Arc GPUs get a path to run V4 efficiently.
For local inference users: DeepSeek-V4-Pro now has 3.25 million downloads on HuggingFace and is climbing to #14 in overall popularity (4,813 likes). If you have a 24GB GPU, the quantized versions of V4 models are now viable through vLLM's optimized paths. The TRTLLM kernel addition means TensorRT-LLM users get a faster route too.
Model Runner V2 expands to dense models. MRv2 is now the default for Llama and Mistral dense models in addition to Qwen3. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, and Gemma 4 MTP support. This means faster throughput and lower latency across the most commonly deployed model families. If you're running Llama 3 8B on an RTX 3060 or 3090, this update gives you a free performance boost — just update your vLLM installation.
Rust frontend grows up. The experimental Rust frontend added streaming generate, dynamic LoRA endpoints, /version, and /server_info endpoints. This is still experimental but moving toward production readiness, which could mean lower-latency API serving for self-hosted deployments.
2. SGLang v0.5.13: Nemotron 3 Ultra + 7 Diffusion Models in One Release
SGLang v0.5.13 dropped June 13 with an enormous model support expansion.
Autoregressive additions: NVIDIA's Nemotron 3 Ultra got day-0 support, along with Step-3.7-Flash and Command A+. Nemotron 3 Ultra is NVIDIA's flagship reasoning model — having immediate SGLang support means you can run it through a highly optimized inference engine rather than waiting for framework adoption.
Diffusion explosion: SGLang now supports 7 diffusion models: Cosmos3, LingBot-World, SANA-WM, Ernie-Image, FLUX.2-Klein 4B/9B, and Ideogram 4. This is significant because SGLang's serving architecture (radix attention, continuous batching, speculative decoding) gives diffusion models the same performance benefits that autoregressive models have enjoyed. FLUX.2-Klein at 4B would run on a 12GB card with aggressive quantization — that's RTX 3060 territory for state-of-the-art image generation through a unified inference engine.
Spec V2 goes default. Tree drafting with topk > 1 is now the production default across Triton, FA3, MLA, and aiter backends, including page_size > 1 and Mamba/hybrid-linear models. This means faster generation across all supported models without configuration changes.
3. Ollama v0.30.8: Better Prompt Caching and MLX Stability
Ollama v0.30.8 shipped June 12 with focused improvements that hit the sweet spot for desktop/local inference users.
Prompt caching decoupled from context shift for better KV cache reuse. This is a practical speedup — repeated prompts (system messages, repeated instructions) get served from cache instead of re-computed. If you're running chat assistants with long system prompts, this directly cuts latency.
MLX inference hardened with stable linear and embedding layers, plus snapshot creation during prompt processing and speculative decoding. Apple Silicon users get more reliable inference.
Recurrent model support improved with per-boundary states from gated-delta kernels. This opens the door for running RWKV and other RNN-style models through Ollama with proper state management.
📊 Model Trends
HuggingFace Top 15 (by likes)
| Rank | Model | Likes | Trend |
|---|---|---|---|
| 1 | deepseek-ai/DeepSeek-R1 | 13,389 | +14 |
| 2 | black-forest-labs/FLUX.1-dev | 13,188 | +103 ⬆️ |
| 3 | stabilityai/stable-diffusion-xl-base-1.0 | 7,814 | — |
| 4 | CompVis/stable-diffusion-v1-4 | 7,020 | — |
| 5 | meta-llama/Meta-Llama-3-8B | 6,574 | — |
| 6 | hexgrad/Kokoro-82M | 6,319 | — |
| 7 | meta-llama/Llama-3.1-8B-Instruct | 6,072 | — |
| 8 | openai/whisper-large-v3 | 5,812 | — |
| 9 | black-forest-labs/FLUX.1-schnell | 5,110 | — |
| 10 | bigscience/bloom | 5,011 | — |
| 11 | stabilityai/stable-diffusion-3-medium | 4,972 | — |
| 12 | sentence-transformers/all-MiniLM-L6-v2 | 4,941 | — |
| 13 | openai/gpt-oss-120b | 4,880 | +23 |
| 14 | deepseek-ai/DeepSeek-V4-Pro | 4,813 | NEW ⬆️ |
| 15 | Tongyi-MAI/Z-Image-Turbo | 4,800 | NEW ⬆️ |
Notable movements: FLUX.1-dev gained 103 likes in 6 days — fastest climb in the top 20. DeepSeek-V4-Pro and Z-Image-Turbo are new top-15 entrants. gpt-oss-120b continues steady growth at 2.81M downloads.
Qwen Family
| Model | Likes | Trend | Notes |
|---|---|---|---|
| Qwen/QwQ-32B | 2,930 | — | Top Qwen model |
| Jackrong/Qwen3.5-27B-Claude-4.6 | 2,878 | +5 | Community distillation |
| Qwen/Qwen-Image | 2,511 | NEW | Text-to-image model |
| Qwen/Qwen-Image-Edit | 2,424 | NEW | Image editing |
| Phr00t/Qwen-Image-Edit-Rapid-AIO | 2,146 | — | Community port |
| Qwen/Qwen3.6-35B-A3B | 2,098 | +66 ⬆️ | MoE, 3.5M downloads |
| Qwen/Qwen2.5-Coder-32B-Instruct | 2,041 | — | |
| Qwen/Qwen2.5-Omni-7B | 1,905 | — |
Qwen3.6-35B-A3B is the story to watch. The MoE architecture with only 3B active parameters means this 35B model can run surprisingly efficiently. At Q4_K_M quantization, the active parameter set plus routing overhead should fit comfortably on an RTX 3060 (12GB) for reasonable context lengths. On an RTX 3090 (24GB), you can run it at Q6 or even Q8 with room for context. The 3.5M download count shows strong adoption.
Gemma Family
| Model | Likes | Trend | Notes |
|---|---|---|---|
| google/gemma-7b | 3,354 | — | Classic |
| google/gemma-4-31B-it | 2,975 | +49 | Latest flagship |
| google/gemma-3-27b-it | 1,976 | — | |
| google/gemma-3n-E4B-it-litert-preview | 1,484 | — | Edge-ready |
| google/gemma-2-2b-it | 1,387 | — | Ultra-light |
| google/gemma-3-4b-it | 1,365 | — | |
| google/gemma-7b-it | 1,247 | — | |
| google/gemma-4-E4B-it | 1,241 | — | |
| google/gemma-2b | 1,193 | — | |
| google/gemma-4-26B-A4B-it | 1,132 | NEW | 8.3M downloads |
Gemma 4 26B-A4B-it is the sleeper hit here. With 8.3 million downloads (the highest in the Gemma family), this MoE variant with 4B active parameters is clearly resonating. On an RTX 3060, Q4 quantization should work for inference. On an RTX 3090, you have headroom for Q8 or even FP16 for the active parameters.
⚙️ Engine Updates
llama.cpp — b9628 (June 14)
- 77 new builds since last scan (b9551 → b9628)
- Added SYCL support to release pipeline (#24583)
- Continues aggressive daily release cadence
- Changelog
Ollama — v0.30.8 (June 12)
- Fixed
ollama launchprovider selection bug - Prompt caching decoupled from context shift — better KV cache reuse
- Hardened MLX inference (linear/embedding layers, snapshots)
- Improved recurrent model support (gated-delta kernels)
- Changelog
vLLM — v0.23.0 (June 12)
- DeepSeek-V4 hardening: TRTLLM kernel, EPLB, prefix-cache retention, XPU support
- Model Runner V2 default for Llama + Mistral dense models
- FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination
- Rust frontend: streaming generate, dynamic LoRA endpoints
- 408 commits from 200 contributors
- Changelog
SGLang — v0.5.13 (June 13)
- Nemotron 3 Ultra day-0 support
- 7 diffusion models: Cosmos3, LingBot-World, SANA-WM, Ernie-Image, FLUX.2-Klein 4B/9B, Ideogram 4
- Spec V2 (tree drafting, topk > 1) now production default
- Step-3.7-Flash, Command A+ autoregressive support
- Changelog
📰 AI News (HN)
Amazon CEO's Talks with U.S. Officials Triggered Crackdown on Anthropic Models
- Wall Street Journal
- Score: 574 on Hacker News
- Government pressure on AI model deployment continues to shape cloud provider policies
🔄 What Changed Since Yesterday
Since the last scan on 2026-06-07, there have been significant changes:
| Area | Previous | Current | Change |
|---|---|---|---|
| llama.cpp | b9551 | b9628 | +77 builds |
| Ollama | v0.30.6 | v0.30.8 | +2 versions |
| vLLM | v0.22.1 | v0.23.0 | Major version bump |
| SGLang | v0.5.12.post1 | v0.5.13 | +1 version |
| DeepSeek-V4-Pro | Not tracked | 4,813 likes | New top-15 model |
| Qwen3.6-35B-A3B | 2,032 likes | 2,098 likes | +66 |
| FLUX.1-dev | 13,085 likes | 13,188 likes | +103 |
| gemma-4-26B-A4B-it | Not tracked | 1,132 likes | New, 8.3M downloads |
Local Inference Recommendations (Updated)
RTX 3060 (12GB):
- Qwen3.6-35B-A3B at Q4_K_M — the MoE architecture with 3B active params makes this surprisingly viable
- Gemma-4-26B-A4B-it at Q4 — 4B active params, similar story
- Llama 3.1 8B at Q8 — still the gold standard for 12GB cards
- Gemma 3n-E4B — purpose-built for edge, runs comfortably
RTX 3090 (24GB):
- Qwen3.6-35B-A3B at Q6/Q8 — full quality with room for context
- Gemma-4-26B-A4B-it at Q8 — high quality
- DeepSeek-V4-Pro quantized — vLLM's new kernels make this practical
- Llama 3.1 8B at FP16 — still fits with generous context headroom
- Gemma-4-31B-it at Q4 — the 31B flagship, quantized
Key takeaway this week: The infrastructure layer is maturing faster than the model layer. vLLM's Model Runner V2 and SGLang's Spec V2 default mean your existing models just got faster — update your engines today. The MoE models (Qwen3.6-A3B, Gemma-4-A4B) are the practical winners for local inference, offering 30B-class capabilities with 3-4B active parameter memory footprints.
Scan completed: 2026-06-13 | Sources: HuggingFace API, llama.cpp GitHub, Ollama GitHub, vLLM GitHub, SGLang GitHub