Model Intelligence — 2026-06-03

AI Model Intelligence — 2026-06-03

🤖 New Model Releases

No new model families today, but existing releases are showing accelerated adoption:

Qwen3.6 Series — Momentum Building

Gemma 4 — Small Model Surging

Community Distillates

Trending on HuggingFace (Top 5)

  1. DeepSeek-R1: 13,364 likes (+2) — Slow but steady
  2. FLUX.1-dev: 13,012 likes (+8) — Approaching 13K
  3. Meta-Llama-3-8B: 6,557 likes
  4. Kokoro-82M: 6,256 likes — TTS
  5. Llama-3.1-8B-Instruct: 5,974 likes (+9) — Notable growth, possibly new GGUF variants

⚙️ Inference Engine Updates

🔴 Ollama v0.30.2 — PATCH RELEASE (Today, June 3)

Two weeks after the v0.30.0 stable rewrite, Ollama drops a patch:

🔴 llama.cpp — 11 Builds Today (b9481–b9491)

The extraordinary release cadence continues — now 11 builds in a single day:

Build Time (UTC)
b9491 2026-06-03 14:17
b9490 2026-06-03 11:46
b9489 2026-06-03 11:22
b9488 2026-06-03 07:47
b9487 2026-06-03 06:25

This is not normal maintenance velocity. Something major is being iterated on — possibly quantization improvements, Vulkan/Metal backend work, or MoE optimization given the current model landscape. b9491 is the current latest.

🟡 SGLang v0.5.12.post1 — No change since May 26

DeepSeek V4 support, TokenSpeed MLA, CUDA 13 compatibility remain the latest features.

🟢 vLLM v0.22.0 — No change since May 29

KV Offload + Hybrid Memory Allocator still the headline feature. Good for memory-constrained multi-model deployments.

📊 Worth Noting

  1. Ollama v0.30.2 is a post-rewrite patch — The llama.cpp rewrite is stabilizing. This is the kind of cadence that suggests the project is healthy and responsive to feedback.

  2. llama.cpp at 11 builds in a single day — This is the highest sustained velocity we've tracked. The team is clearly working on something significant. Watch GitHub PRs for clues — could be MoE-specific optimizations given current model trends. b9491 is the current latest.

  3. MoE adoption is real and growing — Qwen3.6-35B-A3B (+8/day) and Gemma-4-E4B-it (+12/day) are both MoE architectures gaining faster than their dense counterparts. The efficiency argument (3-4B active params for 30B+ quality) is resonating.

  4. Llama-3.1-8B-Instruct growing again (+9/day) — Possibly driven by new GGUF quantization variants or community fine-tunes. Still the go-to for 10GB+ cards running a proven, well-supported model.

  5. The "consolidation period" continues — No major new model families since early June. This typically means the next wave is building. Late June/early July is a reasonable window to expect new releases.

🖥️ Hardware Sweet Spots

GPU Best Models Today Notes
RTX 3090 (24GB) Qwen3.6-35B-A3B (Q6), Gemma-4-31B-it (Q4), Qwen3.6-27B (Q4) Still the ideal balance for large models
RTX 4060 Ti (16GB) Qwen3.6-35B-A3B (Q5), Gemma-4-31B-it (Q3-Q4) Best value mid-tier option
RTX 3080 (10-12GB) Qwen3.6-35B-A3B (Q4), Gemma-4-E4B-it (Q8) MoE models make small VRAM viable
Any GPU (4-6GB) Gemma-4-E4B-it (Q8), Gemma-2-2B (Q8) 4B models are genuinely usable everywhere

Sources: HuggingFace API · llama.cpp Releases · Ollama Releases · SGLang Releases · vLLM Releases

model-releasesinferenceollamallama.cppmoE