Model Intelligence — 2026-06-17
🔥 Top Stories
1. DeepSeek-V4-Pro Pulls Further Ahead: +8 Likes to 4,906
DeepSeek-V4-Pro continued its ascent, gaining +8 likes (4,898 → 4,906) to firmly overtake OpenAI's GPT-oss-120b (stagnant at 4,889) at #13 on HuggingFace trending. This is the second consecutive day of outsized growth, totaling +23 likes since the flip.
The momentum isn't random. Three forces are aligning: (1) vLLM's v0.23.0 released dedicated DeepSeek-V4 hardening last week, (2) llama.cpp's NVFP4 fixes yesterday make local quantized inference more reliable, and (3) the community is clearly gravitating toward the hybrid MoE architecture for its reasoning-to-parameter ratio. At this rate, V4-Pro could crack the top 12 within a week.
2. llama.cpp b9673: SYCL USM System Allocations for Large GPU Buffers
A new llama.cpp build shipped today (b9673, June 17) introducing optional USM (Unified Shared Memory) system allocations for GPU buffers ≥ 1GB. This is an AMD/intel GPU enablement play — SYCL is the cross-vendor compute API, and USM lets large models bypass pinned memory allocation failures on systems with constrained host memory.
Why it matters: If you run on AMD GPUs (or Intel Arc), this directly addresses OOM issues when loading models near VRAM limits. Not a game-changer for NVIDIA users, but a meaningful step for open hardware inference.
3. Qwen3.6-35B-A3B: Quiet but Relentless Climb (+10 Likes)
The Qwen3.6-35B-A3B MoE model jumped +10 likes (2,137 → 2,147), the strongest Qwen movement today. Its sparse architecture (35B total, 3B active) makes it one of the most efficient large models for mid-range GPUs. At Q4 quantization on an RTX 3060, it uses ~8-10GB VRAM while delivering reasoning quality that approaches much larger dense models. The community is discovering this sweet spot.
📊 Model Trends
HuggingFace Trending (Top 15)
| Rank | Model | Likes | Δ (24h) | Notes |
|---|---|---|---|---|
| 1 | deepseek-ai/DeepSeek-R1 | 13,394 | +1 | Unshaken |
| 2 | black-forest-labs/FLUX.1-dev | 13,221 | +2 | Image gen standard |
| 3 | stabilityai/SDXL-1.0 | 7,823 | — | Flat |
| 4 | CompVis/stable-diffusion-v1-4 | 7,021 | — | Legacy holdover |
| 5 | meta-llama/Meta-Llama-3-8B | 6,578 | — | The 8B benchmark |
| 6 | hexgrad/Kokoro-82M | 6,348 | +4 | TTS momentum continues |
| 7 | meta-llama/Llama-3.1-8B-Instruct | 6,098 | +1 | Steady |
| 8 | openai/whisper-large-v3 | 5,826 | — | Flat |
| 9 | black-forest-labs/FLUX.1-schnell | 5,144 | +2 | Steady growth |
| 10 | bigscience/bloom | 5,011 | — | Legacy |
| 11 | stabilityai/SD3-medium | 4,976 | — | |
| 12 | sentence-transformers/all-MiniLM-L6-v2 | 4,959 | — | Embedding workhorse |
| 13 | deepseek-ai/DeepSeek-V4-Pro | 4,906 | +8 | 🔥 Still climbing |
| 14 | openai/gpt-oss-120b | 4,889 | — | Flat, passed up |
| 15 | Tongyi-MAI/Z-Image-Turbo | 4,814 | +2 | New image model |
Signal: The #13↔#14 gap widened from 9 to 17 likes — V4-Pro is pulling away, not just edging ahead. Kokoro-82M's consistent +4/day trajectory suggests TTS models are becoming a new growth category on HF.
Qwen Model Lineup
| Model | Likes | Δ | VRAM Fit |
|---|---|---|---|
| Qwen/QwQ-32B | 2,932 | — | RTX 3090 @ Q4 (~19GB) |
| Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | 2,882 | +1 | RTX 3090 @ Q4 (~17GB) |
| Qwen/Qwen-Image | 2,511 | — | Multi-modal |
| Qwen/Qwen-Image-Edit | 2,425 | — | Multi-modal |
| Qwen/Qwen3.6-35B-A3B | 2,147 | +10 | RTX 3060 @ Q4 (~10GB MoE) ✅ |
| Qwen/Qwen2.5-Coder-32B-Instruct | 2,043 | — | RTX 3090 @ Q4 (~19GB) |
| Qwen/Qwen2.5-Omni-7B | 1,910 | — | RTX 3060 @ Q4 (~5GB) ✅ |
Google Gemma Ecosystem
| Model | Likes | Δ | VRAM Fit |
|---|---|---|---|
| google/gemma-7b | 3,358 | — | RTX 3060 @ Q4 (~5GB) ✅ |
| google/gemma-4-31B-it | 3,007 | +3 | RTX 3090 @ Q4 (~19GB) |
| google/gemma-3-27b-it | 1,980 | — | RTX 3090 @ Q4 (~17GB) |
| dealignai/Gemma-4-31B-JANG_4M-CRACK | 1,645 | — | Community fine-tune |
| google/gemma-3n-E4B-it-litert-preview | 1,485 | — | RTX 3060 @ Q4 (~3GB) ✅ |
New today: yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF — a community fine-tune with 1,289 likes, published June 16. A coder-specialized Gemma 4 variant in GGUF format — ready for local inference.
⚙️ Engine Updates
llama.cpp — b9673 (New Today)
| Build | Date | Key Changes |
|---|---|---|
| b9673 | 2026-06-17 | SYCL: USM system allocations for large GPU buffers ≥1GB |
| b9672 | 2026-06-16 | BoringSSL 0.20260616.0 update |
| b9670 | 2026-06-16 | NVFP4 edge-case fixes in llama-graph |
Analysis: Only 1 build today (vs. 5 yesterday) — a quieter day. The SYCL USM addition is targeted but meaningful for non-NVIDIA hardware. If you're on AMD, upgrade.
Ollama — v0.30.9 (No change)
Still on v0.30.9 from June 15. Cohere2Moe support, LFM2 parser fixes, and the Hermes Desktop integration from v0.30.7 remain the latest features.
vLLM — v0.23.0 (No change)
v0.23.0 from June 15 remains current. DeepSeek-V4 hardening, MRv2 multi-token prediction, Rust frontend, Gemma 4 Unified support, and multi-tier KV caching. MiniMax M3 still available only via recipe.
SGLang — v0.5.13 (No change)
v0.5.13 from June 13 is still latest. Nemotron 3 Ultra autoregressive support was the headline addition.
📰 AI News (Hacker News)
| Score | Story | Link |
|---|---|---|
| 208 | GPT‑NL: a sovereign language model for the Netherlands | HN |
Analysis: Only one AI story hit the HN fetch threshold today. GPT-NL at 208 points continues the trend of sovereign/national AI models — a pattern we saw with France's Jina and other regional efforts. The underlying signal: governments are investing in language-specific models to reduce dependency on US-centric AI. Expect this trend to accelerate.
🔄 What Changed Since Yesterday
| Area | Yesterday | Today | Delta |
|---|---|---|---|
| DeepSeek-V4-Pro | 4,898 likes | 4,906 likes | +8 |
| GPT-oss-120b | 4,889 likes | 4,889 likes | 0 (flat) |
| Qwen3.6-35B-A3B | 2,137 likes | 2,147 likes | +10 |
| Kokoro-82M (TTS) | 6,344 likes | 6,348 likes | +4 |
| Gemma-4-31B-it | 3,004 likes | 3,007 likes | +3 |
| FLUX.1-schnell | 5,142 likes | 5,144 likes | +2 |
| llama.cpp | b9672 | b9673 | +1 build, SYCL USM |
| Ollama | v0.30.9 | v0.30.9 | No change |
| vLLM | v0.23.0 | v0.23.0 | No change |
| SGLang | v0.5.13 | v0.5.13 | No change |
Summary: A steadier day than yesterday's llama.cpp sprint. The defining narrative is DeepSeek-V4-Pro's continued separation from GPT-oss-120b (gap widened to 17 likes), and Qwen3.6-35B-A3B's quiet +10 — the most efficient MoE model for consumer GPUs keeps gaining.
💡 Local Inference Recommendations
RTX 3060 (12GB VRAM) — Best Options Today:
- Qwen3.6-35B-A3B (MoE, ~10GB @ Q4) — Best reasoning/coding for the price
- Gemma-4-26B-A4B-it (MoE, ~7GB @ Q4) — Great instruction-following with room for context
- Qwen2.5-Omni-7B (~5GB @ Q4) — Multi-modal option with video/audio support
- Gemma-3n-E4B-it (~3GB @ Q4) — Ultra-lightweight for constrained setups
RTX 3090/4090 (24GB VRAM) — Best Options Today:
- DeepSeek-V4-Pro — Full vLLM + llama.cpp support, best reasoning
- QwQ-32B (~19GB @ Q4) — Strong reasoning model
- Gemma-4-31B-it (~19GB @ Q4) — Best instruction-following in the 30B class
- Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (~17GB @ Q4) — Claude-quality reasoning distilled into a local model
Model Intelligence brief generated 2026-06-17 by Hermes Agent.
Sources: HuggingFace API, llama.cpp releases, Ollama releases, vLLM releases, SGLang releases, Hacker News