Model Intelligence — 2026-06-13

🔥 Top Stories

1. vLLM v0.23.0: DeepSeek-V4 Hardening, Model Runner V2 Goes Wider

vLLM v0.23.0 shipped on June 12 with 408 commits from 200 contributors (63 new). This is a substantial release that matters for anyone running local or edge inference.

DeepSeek-V4 gets serious production treatment. Following its introduction in v0.22.0, the V4 architecture received a comprehensive optimization pass: sparse MLA metadata decoupled from V3.2, a new TRTLLM-gen attention kernel, EPLB support for the Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP. The model was also detached from torch.compile, and its attention/RoPE paths were refactored for better performance. An XPU attention decode path was added — meaning Intel Arc GPUs get a path to run V4 efficiently.

For local inference users: DeepSeek-V4-Pro now has 3.25 million downloads on HuggingFace and is climbing to #14 in overall popularity (4,813 likes). If you have a 24GB GPU, the quantized versions of V4 models are now viable through vLLM's optimized paths. The TRTLLM kernel addition means TensorRT-LLM users get a faster route too.

Model Runner V2 expands to dense models. MRv2 is now the default for Llama and Mistral dense models in addition to Qwen3. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, and Gemma 4 MTP support. This means faster throughput and lower latency across the most commonly deployed model families. If you're running Llama 3 8B on an RTX 3060 or 3090, this update gives you a free performance boost — just update your vLLM installation.

Rust frontend grows up. The experimental Rust frontend added streaming generate, dynamic LoRA endpoints, /version, and /server_info endpoints. This is still experimental but moving toward production readiness, which could mean lower-latency API serving for self-hosted deployments.

2. SGLang v0.5.13: Nemotron 3 Ultra + 7 Diffusion Models in One Release

SGLang v0.5.13 dropped June 13 with an enormous model support expansion.

Autoregressive additions: NVIDIA's Nemotron 3 Ultra got day-0 support, along with Step-3.7-Flash and Command A+. Nemotron 3 Ultra is NVIDIA's flagship reasoning model — having immediate SGLang support means you can run it through a highly optimized inference engine rather than waiting for framework adoption.

Diffusion explosion: SGLang now supports 7 diffusion models: Cosmos3, LingBot-World, SANA-WM, Ernie-Image, FLUX.2-Klein 4B/9B, and Ideogram 4. This is significant because SGLang's serving architecture (radix attention, continuous batching, speculative decoding) gives diffusion models the same performance benefits that autoregressive models have enjoyed. FLUX.2-Klein at 4B would run on a 12GB card with aggressive quantization — that's RTX 3060 territory for state-of-the-art image generation through a unified inference engine.

Spec V2 goes default. Tree drafting with topk > 1 is now the production default across Triton, FA3, MLA, and aiter backends, including page_size > 1 and Mamba/hybrid-linear models. This means faster generation across all supported models without configuration changes.

3. Ollama v0.30.8: Better Prompt Caching and MLX Stability

Ollama v0.30.8 shipped June 12 with focused improvements that hit the sweet spot for desktop/local inference users.

Prompt caching decoupled from context shift for better KV cache reuse. This is a practical speedup — repeated prompts (system messages, repeated instructions) get served from cache instead of re-computed. If you're running chat assistants with long system prompts, this directly cuts latency.

MLX inference hardened with stable linear and embedding layers, plus snapshot creation during prompt processing and speculative decoding. Apple Silicon users get more reliable inference.

Recurrent model support improved with per-boundary states from gated-delta kernels. This opens the door for running RWKV and other RNN-style models through Ollama with proper state management.

📊 Model Trends

HuggingFace Top 15 (by likes)

Rank Model Likes Trend
1 deepseek-ai/DeepSeek-R1 13,389 +14
2 black-forest-labs/FLUX.1-dev 13,188 +103 ⬆️
3 stabilityai/stable-diffusion-xl-base-1.0 7,814
4 CompVis/stable-diffusion-v1-4 7,020
5 meta-llama/Meta-Llama-3-8B 6,574
6 hexgrad/Kokoro-82M 6,319
7 meta-llama/Llama-3.1-8B-Instruct 6,072
8 openai/whisper-large-v3 5,812
9 black-forest-labs/FLUX.1-schnell 5,110
10 bigscience/bloom 5,011
11 stabilityai/stable-diffusion-3-medium 4,972
12 sentence-transformers/all-MiniLM-L6-v2 4,941
13 openai/gpt-oss-120b 4,880 +23
14 deepseek-ai/DeepSeek-V4-Pro 4,813 NEW ⬆️
15 Tongyi-MAI/Z-Image-Turbo 4,800 NEW ⬆️

Notable movements: FLUX.1-dev gained 103 likes in 6 days — fastest climb in the top 20. DeepSeek-V4-Pro and Z-Image-Turbo are new top-15 entrants. gpt-oss-120b continues steady growth at 2.81M downloads.

Qwen Family

Model Likes Trend Notes
Qwen/QwQ-32B 2,930 Top Qwen model
Jackrong/Qwen3.5-27B-Claude-4.6 2,878 +5 Community distillation
Qwen/Qwen-Image 2,511 NEW Text-to-image model
Qwen/Qwen-Image-Edit 2,424 NEW Image editing
Phr00t/Qwen-Image-Edit-Rapid-AIO 2,146 Community port
Qwen/Qwen3.6-35B-A3B 2,098 +66 ⬆️ MoE, 3.5M downloads
Qwen/Qwen2.5-Coder-32B-Instruct 2,041
Qwen/Qwen2.5-Omni-7B 1,905

Qwen3.6-35B-A3B is the story to watch. The MoE architecture with only 3B active parameters means this 35B model can run surprisingly efficiently. At Q4_K_M quantization, the active parameter set plus routing overhead should fit comfortably on an RTX 3060 (12GB) for reasonable context lengths. On an RTX 3090 (24GB), you can run it at Q6 or even Q8 with room for context. The 3.5M download count shows strong adoption.

Gemma Family

Model Likes Trend Notes
google/gemma-7b 3,354 Classic
google/gemma-4-31B-it 2,975 +49 Latest flagship
google/gemma-3-27b-it 1,976
google/gemma-3n-E4B-it-litert-preview 1,484 Edge-ready
google/gemma-2-2b-it 1,387 Ultra-light
google/gemma-3-4b-it 1,365
google/gemma-7b-it 1,247
google/gemma-4-E4B-it 1,241
google/gemma-2b 1,193
google/gemma-4-26B-A4B-it 1,132 NEW 8.3M downloads

Gemma 4 26B-A4B-it is the sleeper hit here. With 8.3 million downloads (the highest in the Gemma family), this MoE variant with 4B active parameters is clearly resonating. On an RTX 3060, Q4 quantization should work for inference. On an RTX 3090, you have headroom for Q8 or even FP16 for the active parameters.

⚙️ Engine Updates

llama.cpp — b9628 (June 14)

Ollama — v0.30.8 (June 12)

vLLM — v0.23.0 (June 12)

SGLang — v0.5.13 (June 13)

📰 AI News (HN)

Amazon CEO's Talks with U.S. Officials Triggered Crackdown on Anthropic Models

🔄 What Changed Since Yesterday

Since the last scan on 2026-06-07, there have been significant changes:

Area Previous Current Change
llama.cpp b9551 b9628 +77 builds
Ollama v0.30.6 v0.30.8 +2 versions
vLLM v0.22.1 v0.23.0 Major version bump
SGLang v0.5.12.post1 v0.5.13 +1 version
DeepSeek-V4-Pro Not tracked 4,813 likes New top-15 model
Qwen3.6-35B-A3B 2,032 likes 2,098 likes +66
FLUX.1-dev 13,085 likes 13,188 likes +103
gemma-4-26B-A4B-it Not tracked 1,132 likes New, 8.3M downloads

Local Inference Recommendations (Updated)

RTX 3060 (12GB):

RTX 3090 (24GB):

Key takeaway this week: The infrastructure layer is maturing faster than the model layer. vLLM's Model Runner V2 and SGLang's Spec V2 default mean your existing models just got faster — update your engines today. The MoE models (Qwen3.6-A3B, Gemma-4-A4B) are the practical winners for local inference, offering 30B-class capabilities with 3-4B active parameter memory footprints.


Scan completed: 2026-06-13 | Sources: HuggingFace API, llama.cpp GitHub, Ollama GitHub, vLLM GitHub, SGLang GitHub

model-intelligencedaily-briefing