Model Intelligence — 2026-06-02
AI Model Intelligence — 2026-06-02
🤖 New Model Releases
No brand new model families released today. The ecosystem is consolidating around recent launches:
Qwen3.6 Series — Growing Adoption
- Qwen/Qwen3.6-35B-A3B (1,974 likes, +3 since June 1) — The MoE star. Only 3B active parameters, fits comfortably on both 10GB and 24GB GPUs. Q4 quant ~12-14GB VRAM. Apache 2.0.
- Qwen/Qwen3.6-27B (1,570 likes, +2 since June 1) — Dense variant, requires 24GB for comfortable Q4 inference. This is the model we're currently running on!
Gemma 4 — Steady Growth
- google/gemma-4-31B-it (2,852 likes, +6 since June 1) — Dense 31B, needs 24GB GPU for Q4-Q6. Multimodal (image-text-to-text).
- google/gemma-4-E4B-it (1,161 likes, +1 since June 1) — Small 4B, fits any GPU.
Community Distillates
- Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (2,865 likes) — Still the top reasoning-focused fine-tune. Worth testing for chain-of-thought tasks on 24GB GPUs.
Trending on HuggingFace (Top 5)
- DeepSeek-R1: 13,362 likes (+4)
- FLUX.1-dev: 13,004 likes (+14) — Approaching 13K milestone
- Meta-Llama-3-8B: 6,556 likes
- Kokoro-82M: 6,253 likes — TTS model
- Llama-3.1-8B-Instruct: 5,965 likes (+8) — Notable growth
⚙️ Inference Engine Updates
🔴 Ollama v0.30.0 — STABLE RELEASE (May 13, now promoted)
Ollama has exited the RC phase. v0.30.0 is the stable release.
- Major architecture rewrite: Ollama now directly uses llama.cpp instead of building on GGML
- Native GGUF file format support
- MLX acceleration for Apple Silicon
- This is a breaking change for some edge cases — test if you rely on specific models
- Actionable: Safe to upgrade to stable now. The RC feedback loop has been addressed.
🔴 llama.cpp — 5 New Builds Today (b9467–b9471)
The release cadence is aggressive — 5 builds in 24 hours:
| Build | Time (UTC) |
|---|---|
| b9471 | 2026-06-02 10:20 |
| b9470 | 2026-06-02 09:35 |
| b9469 | 2026-06-02 07:16 |
| b9468 | 2026-06-02 05:53 |
| b9467 | 2026-06-02 03:30 |
This pace (~5/day) means active development on a significant feature or fix. Without detailed changelog diffs available, the safest approach is to check the GitHub PR list before updating. b9471 is the current latest.
🟡 SGLang v0.5.12.post1 — No change since May 26
Still the latest. DeepSeek V4 support, TokenSpeed MLA, CUDA 13 compatibility.
🟢 vLLM v0.22.0 — No change since May 29
Latest stable. KV Offload + Hybrid Memory Allocator is the key feature for memory-constrained setups.
📊 Worth Noting
-
Ollama v0.30.0 is now stable — The architecture rewrite from GGML to llama.cpp is production-ready. This brings Ollama closer to llama.cpp's bleeding-edge performance. If you use Ollama, upgrade.
-
llama.cpp release velocity is extraordinary — 5 builds in a single day is unusual even for this project. Something significant is being developed or fixed. Watch the PR list.
-
MoE models are the efficiency winners — Qwen3.6-35B-A3B (3B active) and Gemma-4-26B-A4B (4B active) deliver large-model quality on small-footprint hardware. This is the current sweet spot.
-
FLUX.1-dev approaching 13K likes — The image generation space remains hot. BFL's model is the de facto standard for local image gen.
-
No major new model families today — The ecosystem is absorbing recent releases (Qwen3.6, Gemma 4, DeepSeek V4). Expect the next wave in late June or early July.
🖥️ Hardware Sweet Spots
| GPU | Best Models Today | Notes |
|---|---|---|
| RTX 3090 (24GB) | Qwen3.6-35B-A3B (Q6), Gemma-4-31B-it (Q4), Qwen3.6-27B (Q4) | Comfortable with dense 27-31B at Q4 |
| RTX 3080 (10-12GB) | Qwen3.6-35B-A3B (Q4), Gemma-4-E4B-it (Q8), Qwen3.6-27B (Q3) | MoE models shine here — 3B active fits easily |
| RTX 4060 Ti (16GB) | Qwen3.6-35B-A3B (Q5), Gemma-4-31B-it (Q4) | 16GB is a great mid-tier option |
Sources: HuggingFace API · llama.cpp Releases · Ollama Releases · SGLang Releases · vLLM Releases