Model Intelligence — 2026-06-03
AI Model Intelligence — 2026-06-03
🤖 New Model Releases
No new model families today, but existing releases are showing accelerated adoption:
Qwen3.6 Series — Momentum Building
- Qwen/Qwen3.6-35B-A3B (1,982 likes, +8 since yesterday) — The MoE star continues climbing. 3B active parameters means it runs on a 10GB card at Q4 (~12-14GB VRAM). Apache 2.0 license makes it commercial-ready.
- Qwen/Qwen3.6-27B (1,580 likes, +10 since yesterday) — Dense variant gaining fast. Needs 24GB GPU for comfortable Q4 inference.
Gemma 4 — Small Model Surging
- google/gemma-4-31B-it (2,861 likes, +9) — Dense 31B, multimodal. Needs 24GB for Q4-Q6.
- google/gemma-4-E4B-it (1,173 likes, +12) — Notable jump for the 4B model. Fits any GPU, great for edge/IoT.
Community Distillates
- Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (2,866 likes) — Still the top reasoning fine-tune. 24GB GPU territory.
Trending on HuggingFace (Top 5)
- DeepSeek-R1: 13,364 likes (+2) — Slow but steady
- FLUX.1-dev: 13,012 likes (+8) — Approaching 13K
- Meta-Llama-3-8B: 6,557 likes
- Kokoro-82M: 6,256 likes — TTS
- Llama-3.1-8B-Instruct: 5,974 likes (+9) — Notable growth, possibly new GGUF variants
⚙️ Inference Engine Updates
🔴 Ollama v0.30.2 — PATCH RELEASE (Today, June 3)
Two weeks after the v0.30.0 stable rewrite, Ollama drops a patch:
- Post-stable bug fixes after the llama.cpp architecture rewrite
- Likely addresses edge cases from the RC period
- Actionable: If you upgraded to v0.30.0, this is a safe follow-up patch
🔴 llama.cpp — 11 Builds Today (b9481–b9491)
The extraordinary release cadence continues — now 11 builds in a single day:
| Build | Time (UTC) |
|---|---|
| b9491 | 2026-06-03 14:17 |
| b9490 | 2026-06-03 11:46 |
| b9489 | 2026-06-03 11:22 |
| b9488 | 2026-06-03 07:47 |
| b9487 | 2026-06-03 06:25 |
This is not normal maintenance velocity. Something major is being iterated on — possibly quantization improvements, Vulkan/Metal backend work, or MoE optimization given the current model landscape. b9491 is the current latest.
🟡 SGLang v0.5.12.post1 — No change since May 26
DeepSeek V4 support, TokenSpeed MLA, CUDA 13 compatibility remain the latest features.
🟢 vLLM v0.22.0 — No change since May 29
KV Offload + Hybrid Memory Allocator still the headline feature. Good for memory-constrained multi-model deployments.
📊 Worth Noting
-
Ollama v0.30.2 is a post-rewrite patch — The llama.cpp rewrite is stabilizing. This is the kind of cadence that suggests the project is healthy and responsive to feedback.
-
llama.cpp at 11 builds in a single day — This is the highest sustained velocity we've tracked. The team is clearly working on something significant. Watch GitHub PRs for clues — could be MoE-specific optimizations given current model trends. b9491 is the current latest.
-
MoE adoption is real and growing — Qwen3.6-35B-A3B (+8/day) and Gemma-4-E4B-it (+12/day) are both MoE architectures gaining faster than their dense counterparts. The efficiency argument (3-4B active params for 30B+ quality) is resonating.
-
Llama-3.1-8B-Instruct growing again (+9/day) — Possibly driven by new GGUF quantization variants or community fine-tunes. Still the go-to for 10GB+ cards running a proven, well-supported model.
-
The "consolidation period" continues — No major new model families since early June. This typically means the next wave is building. Late June/early July is a reasonable window to expect new releases.
🖥️ Hardware Sweet Spots
| GPU | Best Models Today | Notes |
|---|---|---|
| RTX 3090 (24GB) | Qwen3.6-35B-A3B (Q6), Gemma-4-31B-it (Q4), Qwen3.6-27B (Q4) | Still the ideal balance for large models |
| RTX 4060 Ti (16GB) | Qwen3.6-35B-A3B (Q5), Gemma-4-31B-it (Q3-Q4) | Best value mid-tier option |
| RTX 3080 (10-12GB) | Qwen3.6-35B-A3B (Q4), Gemma-4-E4B-it (Q8) | MoE models make small VRAM viable |
| Any GPU (4-6GB) | Gemma-4-E4B-it (Q8), Gemma-2-2B (Q8) | 4B models are genuinely usable everywhere |
Sources: HuggingFace API · llama.cpp Releases · Ollama Releases · SGLang Releases · vLLM Releases