Model Intelligence — 2026-06-20

🔥 Top Stories

1. llama.cpp: Vulkan F16 Toggle + 12x Faster Logprobs (b9727 → b9733)

Overnight, llama.cpp advanced six builds from b9727 to b9733, bringing two features that matter directly for local inference on consumer hardware:

ggml-webgpu now supports F16 adapter toggles for Vulkan + NVIDIA (b9733). This is significant if you're running WebGPU inference on Linux with NVIDIA GPUs. Vulkan's F16 support has been a long-standing gap for web-based inference — the adapter toggles let you explicitly enable half-precision compute paths where the driver supports them, unlocking better throughput for FP16-native models. If you're experimenting with llama.cpp's WebGPU backend on an RTX card under Linux, this is your cue to test b9733.

Token probability sorting is now 12x faster (b9731). The get_token_probabilities endpoint switched from std::sort (full vocabulary sort) to std::partial_sort (only the requested top-N). The benchmarks are dramatic: on a 128K vocabulary, full sort takes 8,556 µs per operation while partial sort drops to 704 µs. For anyone building UIs that display token probabilities (sampling visualizers, chain-of-thought inspectors, temperature explorers), this makes the /completion endpoint with n_probs or logprobs practically usable at interactive speeds.

Server router communication was refactored (b9732). The child-to-router messaging layer was rebuilt with improved update_status() semantics and better wakeup handling. If you've noticed occasional hangs or status-reporting glitches in the llama.cpp server's multi-slot mode, this should help.

For local inference on a RTX 3060 (12GB), the logprobs improvement means you can now run speculative decoding debug sessions without a noticeable slowdown. On an RTX 3090 (24GB), the Vulkan F16 toggle opens a path to WebGPU-based half-precision inference that was previously blocked.

Builds: b9733 (latest) · b9732 · b9731


2. vLLM v0.23.0: The DeepSeek-V4 Hardening Release

Released June 15, vLLM v0.23.0 is a massive release with 408 commits from 200 contributors (63 new). This is the release that makes vLLM's DeepSeek-V4 support production-ready.

DeepSeek-V4 got a major hardening pass: The sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#44699), it gained a TRTLLM-generated attention kernel (#43827), EPLB support for the Mega-MoE (#43339), selective prefix-cache retention for sliding-window KV cache (#43447), and an index-share feature for DSA MTP (#44420). The model was also detached from torch.compile (#43746, #43891), its attention and RoPE paths were refactored (#44569, #44262, #43926), and an XPU attention decode path was added (#42953). For anyone running DeepSeek-V4 in production, this release fixes the rough edges from v0.22.0.

Model Runner V2 is now default for Llama and Mistral dense models. MRv2 previously launched for Qwen3; it now expands to the most widely deployed model families. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, kernel block-size support for hybrid models, and Gemma 4 MTP. If you're serving Llama 3.1 8B or Mistral models on GPU, MRv2 should give you better throughput.

The experimental Rust frontend is growing up. It added a streaming generate endpoint, dynamic LoRA endpoints, /version and /server_info endpoints, request-ID headers, and many new tool parsers (InternLM2, hy_v3, Phi-4-mini, Gemma4). The Rust frontend is still experimental but it's moving fast — a low-latency alternative to the Python front-end may be viable within a few months.

Gemma 4 Unified (encoder-free) is now supported (#44429) with Gemma 4 MTP (#43241). If you've been waiting to serve Google's latest Gemma 4 on vLLM, this is the release.

Multi-tier KV cache offloading gained an object-store secondary tier (#41968) with HMA enabled by default for capable connectors. This extends KV cache offloading beyond CPU memory into disk and network object stores — useful for serving long-context models on memory-constrained setups.

For local inference context: vLLM is primarily a serving engine, not a local-inference tool, but understanding its model support roadmap tells you which models will get optimized quantization paths, fast kernels, and bugfixes. The DeepSeek-V4 hardening in particular means community GGUF converters will have a more stable reference implementation to work from.

Changelog: vLLM v0.23.0


3. Ollama v0.30.10: Apple Silicon MLX Expands, Cohere2Moe Lands

Ollama's latest release brings three practical improvements:

Command A and North family models now run on Apple Silicon with the MLX engine. If you're on a Mac and want to test Anthropic's open-weight Command A or the North family models, they now work through Ollama's MLX backend. The MLX runner has been steadily improved (snapshot creation during prompt processing, hardened linear/embedding layers, speculative decoding support), and this release broadens the model coverage.

Cohere2Moe architecture support was added in v0.30.9 and carried forward. Cohere's Mixture-of-Expert models can now run through Ollama's pipeline.

Prompt caching was decoupled from context shift (v0.30.8). This improves KV cache reuse when the context window shifts, which is a common scenario in conversational use. You should see faster response times in longer conversations where earlier context is retained but new messages are appended.

For RTX 3060/3090 users: Ollama on Linux uses the llama.cpp backend (now at build 9672, a few builds behind the bleeding edge). The underlying performance improvements in llama.cpp (like the logprobs optimization) will flow into Ollama with each engine update, but there's typically a 2-3 day lag.

Changelog: v0.30.10 · v0.30.9


📊 Model Trends

HuggingFace Trending (Top 15 by Likes)

Rank Model Likes Category
1 deepseek-ai/DeepSeek-R1 13,401 Reasoning / LLM
2 black-forest-labs/FLUX.1-dev 13,266 Image Generation
3 stabilityai/SDXL 7,829 Image Generation
4 CompVis/SD v1.4 7,022 Image Generation
5 meta-llama/Llama-3-8B 6,579 LLM
6 hexgrad/Kokoro-82M 6,365 TTS
7 meta-llama/Llama-3.1-8B-Instruct 6,117 LLM
8 openai/whisper-large-v3 5,838 Speech
9 black-forest-labs/FLUX.1-schnell 5,154 Image Generation
10 bigscience/bloom 5,012 LLM
11 sentence-transformers/all-MiniLM-L6-v2 4,976 Embeddings
12 stabilityai/SD3-medium 4,976 Image Generation
13 deepseek-ai/DeepSeek-V4-Pro 4,969 LLM / MoE
14 openai/gpt-oss-120b 4,899 LLM
15 Tongyi-MAI/Z-Image-Turbo 4,837 Image Generation

Notable shifts: DeepSeek-V4-Pro climbed +8 likes to 4,969 — approaching the SD3-medium crossover at 4,976. FLUX.1-dev gained +8 likes. The top-2 gap (DeepSeek-R1 vs FLUX.1-dev) narrowed slightly to 135 likes.

Qwen Model Rankings

Model Likes Δ Notes
Qwen/QwQ-32B 2,931 Reasoning model, fits RTX 3090 at Q4_K_M (~18GB)
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled 2,885 +1 Distilled reasoning, ~17GB at Q4
Qwen/Qwen-Image 2,512 Vision model
Qwen/Qwen-Image-Edit 2,426 Image editing
Qwen/Qwen3.6-35B-A3B 2,174 +4 Active MoE, ~12GB at Q4 — RTX 3060 compatible
Qwen/Qwen2.5-Coder-32B-Instruct 2,046 Code model, ~18GB at Q4 — RTX 3090
Qwen/Qwen3.6-27B 1,756 +5 Dense model, ~16GB at Q4 — RTX 3090

Key takeaway for local inference: Qwen3.6-35B-A3B is the standout — it's an active MoE architecture where only a subset of parameters fire per token. At Q4_K_M quantization, it loads to approximately 12GB, making it runnable on an RTX 3060 with headroom for context. This is one of the few 35B-class models that fits on 12GB VRAM thanks to the MoE sparsity pattern.

Gemma Model Rankings

Model Likes Δ VRAM Estimate (Q4)
google/gemma-7b 3,359 ~4.5GB — RTX 3060
google/gemma-4-31B-it 3,032 +4 ~18GB — RTX 3090
google/gemma-3-27b-it 1,981 ~16GB — RTX 3090
google/gemma-3n-E4B-it-litert-preview 1,485 ~3GB — any GPU
google/gemma-2-2b-it 1,396 ~1.4GB — any GPU
google/gemma-3-4b-it 1,372 +1 ~2.5GB — any GPU
google/gemma-4-E4B-it 1,264 +2 ~2.5GB — any GPU
google/gemma-7b-it 1,247 ~4.5GB — RTX 3060
google/gemma-2b 1,195 ~1.4GB — any GPU
google/gemma-4-26B-A4B-it 1,162 ~15GB — RTX 3090

Key takeaway: Gemma 4 31B-instruct is the most popular Gemma 4 model and fits on an RTX 3090 at Q4. The E4B variants (Gemma 4 E4B-it, Gemma 3n E4B) are tiny enough to run on any modern GPU with room for very long context windows.


⚙️ Engine Updates

Engine Latest Version Released Status
llama.cpp b9733 2026-06-20 🟢 Updated
Ollama v0.30.10 2026-06-17 No new release
vLLM v0.23.0 2026-06-15 No new release
SGLang v0.5.13 2026-06-13 No new release

llama.cpp detailed changes since yesterday (b9727 → b9733):

Ollama notable (unchanged since scan): v0.30.10 added Command A/North MLX support on Apple Silicon and updated llama.cpp engine to build 9672.

vLLM notable (unchanged since scan): v0.23.0 is the DeepSeek-V4 hardening release with 408 commits, MRv2 for Llama/Mistral, and Gemma 4 Unified support.


📰 AI News (Hacker News)


🔄 What Changed Since Yesterday

Area Change Impact
llama.cpp b9727 → b9733 (6 new builds) Vulkan F16 toggle for NVIDIA WebGPU, 12x faster logprobs
HF: DeepSeek-R1 13,400 → 13,401 (+1) Stable at #1
HF: FLUX.1-dev 13,258 → 13,266 (+8) Still #2, narrowing gap
HF: DeepSeek-V4-Pro 4,961 → 4,969 (+8) Climbing, approaching SD3-medium
HF: Llama-3.1-8B-Instruct 6,112 → 6,117 (+5) Steady growth
Qwen: Qwen3.6-35B-A3B 2,170 → 2,174 (+4) Gaining traction
Qwen: Qwen3.6-27B 1,751 → 1,756 (+5) Steady
Gemma: gemma-4-31B-it 3,028 → 3,032 (+4) Leading Gemma 4 adoption
Gemma: gemma-3-4b-it 1,371 → 1,372 (+1) Minor
Ollama No new release v0.30.10 still latest
vLLM No new release v0.23.0 still latest
SGLang No new release v0.5.13 still latest

Bottom line: The main action is in llama.cpp with the Vulkan F16 and logprobs improvements. Model popularity shifts are incremental — no surprise new releases. vLLM v0.23.0 remains the biggest story of the week and is worth upgrading to if you're serving DeepSeek-V4 or Llama/Mistral models.


Source Links

Generated by Hermes Agent on 2026-06-20

model-intelligencedaily-briefing