Model Intelligence — 2026-06-20
🔥 Top Stories
1. llama.cpp: Vulkan F16 Toggle + 12x Faster Logprobs (b9727 → b9733)
Overnight, llama.cpp advanced six builds from b9727 to b9733, bringing two features that matter directly for local inference on consumer hardware:
ggml-webgpu now supports F16 adapter toggles for Vulkan + NVIDIA (b9733). This is significant if you're running WebGPU inference on Linux with NVIDIA GPUs. Vulkan's F16 support has been a long-standing gap for web-based inference — the adapter toggles let you explicitly enable half-precision compute paths where the driver supports them, unlocking better throughput for FP16-native models. If you're experimenting with llama.cpp's WebGPU backend on an RTX card under Linux, this is your cue to test b9733.
Token probability sorting is now 12x faster (b9731). The get_token_probabilities endpoint switched from std::sort (full vocabulary sort) to std::partial_sort (only the requested top-N). The benchmarks are dramatic: on a 128K vocabulary, full sort takes 8,556 µs per operation while partial sort drops to 704 µs. For anyone building UIs that display token probabilities (sampling visualizers, chain-of-thought inspectors, temperature explorers), this makes the /completion endpoint with n_probs or logprobs practically usable at interactive speeds.
Server router communication was refactored (b9732). The child-to-router messaging layer was rebuilt with improved update_status() semantics and better wakeup handling. If you've noticed occasional hangs or status-reporting glitches in the llama.cpp server's multi-slot mode, this should help.
For local inference on a RTX 3060 (12GB), the logprobs improvement means you can now run speculative decoding debug sessions without a noticeable slowdown. On an RTX 3090 (24GB), the Vulkan F16 toggle opens a path to WebGPU-based half-precision inference that was previously blocked.
Builds: b9733 (latest) · b9732 · b9731
2. vLLM v0.23.0: The DeepSeek-V4 Hardening Release
Released June 15, vLLM v0.23.0 is a massive release with 408 commits from 200 contributors (63 new). This is the release that makes vLLM's DeepSeek-V4 support production-ready.
DeepSeek-V4 got a major hardening pass: The sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#44699), it gained a TRTLLM-generated attention kernel (#43827), EPLB support for the Mega-MoE (#43339), selective prefix-cache retention for sliding-window KV cache (#43447), and an index-share feature for DSA MTP (#44420). The model was also detached from torch.compile (#43746, #43891), its attention and RoPE paths were refactored (#44569, #44262, #43926), and an XPU attention decode path was added (#42953). For anyone running DeepSeek-V4 in production, this release fixes the rough edges from v0.22.0.
Model Runner V2 is now default for Llama and Mistral dense models. MRv2 previously launched for Qwen3; it now expands to the most widely deployed model families. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, kernel block-size support for hybrid models, and Gemma 4 MTP. If you're serving Llama 3.1 8B or Mistral models on GPU, MRv2 should give you better throughput.
The experimental Rust frontend is growing up. It added a streaming generate endpoint, dynamic LoRA endpoints, /version and /server_info endpoints, request-ID headers, and many new tool parsers (InternLM2, hy_v3, Phi-4-mini, Gemma4). The Rust frontend is still experimental but it's moving fast — a low-latency alternative to the Python front-end may be viable within a few months.
Gemma 4 Unified (encoder-free) is now supported (#44429) with Gemma 4 MTP (#43241). If you've been waiting to serve Google's latest Gemma 4 on vLLM, this is the release.
Multi-tier KV cache offloading gained an object-store secondary tier (#41968) with HMA enabled by default for capable connectors. This extends KV cache offloading beyond CPU memory into disk and network object stores — useful for serving long-context models on memory-constrained setups.
For local inference context: vLLM is primarily a serving engine, not a local-inference tool, but understanding its model support roadmap tells you which models will get optimized quantization paths, fast kernels, and bugfixes. The DeepSeek-V4 hardening in particular means community GGUF converters will have a more stable reference implementation to work from.
Changelog: vLLM v0.23.0
3. Ollama v0.30.10: Apple Silicon MLX Expands, Cohere2Moe Lands
Ollama's latest release brings three practical improvements:
Command A and North family models now run on Apple Silicon with the MLX engine. If you're on a Mac and want to test Anthropic's open-weight Command A or the North family models, they now work through Ollama's MLX backend. The MLX runner has been steadily improved (snapshot creation during prompt processing, hardened linear/embedding layers, speculative decoding support), and this release broadens the model coverage.
Cohere2Moe architecture support was added in v0.30.9 and carried forward. Cohere's Mixture-of-Expert models can now run through Ollama's pipeline.
Prompt caching was decoupled from context shift (v0.30.8). This improves KV cache reuse when the context window shifts, which is a common scenario in conversational use. You should see faster response times in longer conversations where earlier context is retained but new messages are appended.
For RTX 3060/3090 users: Ollama on Linux uses the llama.cpp backend (now at build 9672, a few builds behind the bleeding edge). The underlying performance improvements in llama.cpp (like the logprobs optimization) will flow into Ollama with each engine update, but there's typically a 2-3 day lag.
📊 Model Trends
HuggingFace Trending (Top 15 by Likes)
| Rank | Model | Likes | Category |
|---|---|---|---|
| 1 | deepseek-ai/DeepSeek-R1 | 13,401 | Reasoning / LLM |
| 2 | black-forest-labs/FLUX.1-dev | 13,266 | Image Generation |
| 3 | stabilityai/SDXL | 7,829 | Image Generation |
| 4 | CompVis/SD v1.4 | 7,022 | Image Generation |
| 5 | meta-llama/Llama-3-8B | 6,579 | LLM |
| 6 | hexgrad/Kokoro-82M | 6,365 | TTS |
| 7 | meta-llama/Llama-3.1-8B-Instruct | 6,117 | LLM |
| 8 | openai/whisper-large-v3 | 5,838 | Speech |
| 9 | black-forest-labs/FLUX.1-schnell | 5,154 | Image Generation |
| 10 | bigscience/bloom | 5,012 | LLM |
| 11 | sentence-transformers/all-MiniLM-L6-v2 | 4,976 | Embeddings |
| 12 | stabilityai/SD3-medium | 4,976 | Image Generation |
| 13 | deepseek-ai/DeepSeek-V4-Pro | 4,969 | LLM / MoE |
| 14 | openai/gpt-oss-120b | 4,899 | LLM |
| 15 | Tongyi-MAI/Z-Image-Turbo | 4,837 | Image Generation |
Notable shifts: DeepSeek-V4-Pro climbed +8 likes to 4,969 — approaching the SD3-medium crossover at 4,976. FLUX.1-dev gained +8 likes. The top-2 gap (DeepSeek-R1 vs FLUX.1-dev) narrowed slightly to 135 likes.
Qwen Model Rankings
| Model | Likes | Δ | Notes |
|---|---|---|---|
| Qwen/QwQ-32B | 2,931 | — | Reasoning model, fits RTX 3090 at Q4_K_M (~18GB) |
| Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | 2,885 | +1 | Distilled reasoning, ~17GB at Q4 |
| Qwen/Qwen-Image | 2,512 | — | Vision model |
| Qwen/Qwen-Image-Edit | 2,426 | — | Image editing |
| Qwen/Qwen3.6-35B-A3B | 2,174 | +4 | Active MoE, ~12GB at Q4 — RTX 3060 compatible |
| Qwen/Qwen2.5-Coder-32B-Instruct | 2,046 | — | Code model, ~18GB at Q4 — RTX 3090 |
| Qwen/Qwen3.6-27B | 1,756 | +5 | Dense model, ~16GB at Q4 — RTX 3090 |
Key takeaway for local inference: Qwen3.6-35B-A3B is the standout — it's an active MoE architecture where only a subset of parameters fire per token. At Q4_K_M quantization, it loads to approximately 12GB, making it runnable on an RTX 3060 with headroom for context. This is one of the few 35B-class models that fits on 12GB VRAM thanks to the MoE sparsity pattern.
Gemma Model Rankings
| Model | Likes | Δ | VRAM Estimate (Q4) |
|---|---|---|---|
| google/gemma-7b | 3,359 | — | ~4.5GB — RTX 3060 |
| google/gemma-4-31B-it | 3,032 | +4 | ~18GB — RTX 3090 |
| google/gemma-3-27b-it | 1,981 | — | ~16GB — RTX 3090 |
| google/gemma-3n-E4B-it-litert-preview | 1,485 | — | ~3GB — any GPU |
| google/gemma-2-2b-it | 1,396 | — | ~1.4GB — any GPU |
| google/gemma-3-4b-it | 1,372 | +1 | ~2.5GB — any GPU |
| google/gemma-4-E4B-it | 1,264 | +2 | ~2.5GB — any GPU |
| google/gemma-7b-it | 1,247 | — | ~4.5GB — RTX 3060 |
| google/gemma-2b | 1,195 | — | ~1.4GB — any GPU |
| google/gemma-4-26B-A4B-it | 1,162 | — | ~15GB — RTX 3090 |
Key takeaway: Gemma 4 31B-instruct is the most popular Gemma 4 model and fits on an RTX 3090 at Q4. The E4B variants (Gemma 4 E4B-it, Gemma 3n E4B) are tiny enough to run on any modern GPU with room for very long context windows.
⚙️ Engine Updates
| Engine | Latest Version | Released | Status |
|---|---|---|---|
| llama.cpp | b9733 | 2026-06-20 | 🟢 Updated |
| Ollama | v0.30.10 | 2026-06-17 | No new release |
| vLLM | v0.23.0 | 2026-06-15 | No new release |
| SGLang | v0.5.13 | 2026-06-13 | No new release |
llama.cpp detailed changes since yesterday (b9727 → b9733):
- b9733: ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA — enables half-precision WebGPU on NVIDIA Linux
- b9732: Server router communication refactor — improved child→router messaging and status updates
- b9731:
get_token_probabilitiesoptimization —std::partial_sortreplaces full sort, 12x faster on 128K vocab (8,556→704 µs) - b9730–b9729: Additional builds on 2026-06-19
Ollama notable (unchanged since scan): v0.30.10 added Command A/North MLX support on Apple Silicon and updated llama.cpp engine to build 9672.
vLLM notable (unchanged since scan): v0.23.0 is the DeepSeek-V4 hardening release with 408 commits, MRv2 for Llama/Mistral, and Gemma 4 Unified support.
📰 AI News (Hacker News)
- [712 pts] Hyundai buys Boston Dynamics — SoftBank exits for $325M, Hyundai takes full control. Link
- [514 pts] Norway imposes near ban on AI in elementary school — New restrictions on AI use in primary education. Link
- [88 pts] John Jumper to join Anthropic — Google DeepMind co-founder moving to Anthropic. Significant talent shift in AI leadership. Link
🔄 What Changed Since Yesterday
| Area | Change | Impact |
|---|---|---|
| llama.cpp | b9727 → b9733 (6 new builds) | Vulkan F16 toggle for NVIDIA WebGPU, 12x faster logprobs |
| HF: DeepSeek-R1 | 13,400 → 13,401 (+1) | Stable at #1 |
| HF: FLUX.1-dev | 13,258 → 13,266 (+8) | Still #2, narrowing gap |
| HF: DeepSeek-V4-Pro | 4,961 → 4,969 (+8) | Climbing, approaching SD3-medium |
| HF: Llama-3.1-8B-Instruct | 6,112 → 6,117 (+5) | Steady growth |
| Qwen: Qwen3.6-35B-A3B | 2,170 → 2,174 (+4) | Gaining traction |
| Qwen: Qwen3.6-27B | 1,751 → 1,756 (+5) | Steady |
| Gemma: gemma-4-31B-it | 3,028 → 3,032 (+4) | Leading Gemma 4 adoption |
| Gemma: gemma-3-4b-it | 1,371 → 1,372 (+1) | Minor |
| Ollama | No new release | v0.30.10 still latest |
| vLLM | No new release | v0.23.0 still latest |
| SGLang | No new release | v0.5.13 still latest |
Bottom line: The main action is in llama.cpp with the Vulkan F16 and logprobs improvements. Model popularity shifts are incremental — no surprise new releases. vLLM v0.23.0 remains the biggest story of the week and is worth upgrading to if you're serving DeepSeek-V4 or Llama/Mistral models.
Source Links
- llama.cpp b9733
- llama.cpp b9732
- llama.cpp b9731
- Ollama v0.30.10
- vLLM v0.23.0
- SGLang v0.5.13
- HuggingFace Trending Models
- Qwen Models on HuggingFace
- Gemma Models on HuggingFace
- Hacker News
Generated by Hermes Agent on 2026-06-20