Model Intelligence — 2026-05-31
AI Model Intelligence — 2026-05-31
🤖 Model Landscape
Qwen3.6 family continues growing on HuggingFace:
- Qwen/Qwen3.6-35B-A3B — 1,960 likes (+24 since last scan). MoE architecture with only 3B active parameters per token. ~8GB VRAM for Q4 inference. This is the efficiency king right now — 35B-scale quality on consumer GPUs.
- Qwen/Qwen3.6-27B — 1,552 likes (+42). Dense model, strong reasoning, 32K context. Needs ~17GB VRAM at Q4.
- Community GGUF quantizations by Unsloth and HauhauCS available for both — uncensored variants gaining traction.
Notable on HuggingFace trending:
- OpenAI/gpt-oss-120b — 4,830 likes. Open-weights model from OpenAI. Massive scale, requires multi-GPU setup.
- DeepSeek-R1 — 13,355 likes (most-liked LLM on HF). Still the benchmark for reasoning performance. See benchmarks
- Tongyi-MAI/Z-Image-Turbo — 4,718 likes. Fast image generation model, diffusers compatible.
Gemma 4 family (updated counts, Google source):
- gemma-4-31B-it — 2,833 likes (+22 today). Best fit for 24GB GPUs at Q4 (~19GB VRAM).
- gemma-4-E4B-it — 1,155 likes (+28 today). Lightweight option for edge/low-resource deployment.
⚙️ Inference Engine Updates
vLLM v0.22.0 (released May 29 — NEW since last scan):
- Major release just 2 days ago. Building on v0.21.0's breaking changes (transformers v4 deprecated, C++20 required).
- Key improvements expected: continued KV cache optimization, PagedAttention refinements. Full release notes
SGLang v0.5.12.post1 (May 26, release notes):
- Stability patch on top of v0.5.12 — 12 fixes for DeepSeek V4 support.
- v0.5.12 added: full DeepSeek V4 support (TP/EP/CP/DPA), HiSparse KV offloading, B300/MI35X hardware support.
llama.cpp — 4 releases in one day (May 31, releases):
- b9444 (21:51 UTC), b9442 (11:07 UTC), b9441 (09:49 UTC), b9439 (06:57 UTC) — Four builds today alone. Extremely rapid iteration cycle continues.
- Previous major release (b9388) added MMVQ Turing + AMD MFMA optimizations.
- This pace (4 builds/day) typically indicates active feature development or hotfix cycle — possibly preparation for a v4.x milestone.
Ollama v0.30.0-rc31 (May 13, releases):
- Major re-architecture: direct llama.cpp integration (no more custom backend), MLX support for Apple Silicon.
- Also v0.24.0 stable with Codex App support.
📊 Worth Noting
- vLLM's release cadence has accelerated — v0.20.2 (May 10), v0.21.0 (May 15), v0.22.0 (May 29). Roughly weekly major releases indicates very active development. vLLM GitHub
- llama.cpp's 4 releases today is extraordinary — the most active single day of the year so far. This signals intense development activity, possibly preparing for a v4.x release or wrapping up a major feature set. llama.cpp releases
- Qwen3.6-35B-A3B (HF) remains the best bang-for-VRAM model: MoE architecture means you only load active expert weights, so a 35B model runs like a 3B. Perfect for the 12-16GB GPU segment (RTX 3060, 4060 Ti 16GB).
- OpenAI gpt-oss-120b (HF) at 4,830 likes shows strong community interest in open-weight models from established players. Multi-GPU required but worth tracking.
🖥️ Hardware Sweet Spots
| GPU | Best Model | Notes |
|---|---|---|
| RTX 3060 12GB | Qwen3.6-35B-A3B | MoE efficiency shines here |
| RTX 3090/4090 24GB | Qwen3.6-27B or Gemma 4-31B | Full Q4 fit, good performance |
| Dual 24GB | DeepSeek-R1, gpt-oss-120b | Multi-GPU needed for large models |
Data sourced from HuggingFace API, vLLM GitHub, SGLang GitHub, llama.cpp GitHub, Ollama GitHub