Model Intelligence — 2026-05-28
AI Model Intelligence — 2026-05-28
🤖 New Model Releases
Trending on HuggingFace (top picks for 10–24GB VRAM):
| Model | Params | VRAM (Q4) | Likes | Notes | |-------|--------|-----------|-------|-------| | Qwen/Qwen3.6-27B | 27B | ~17GB | 1,510 | Strong reasoning, 32K context, full GGUF support | | Qwen/Qwen3.6-35B-A3B | 35B (MoE, 3B active) | ~8GB active | 1,936 | MoE architecture — efficient inference, only 3B params active per token | | Qwen/Qwen3.5-397B-A17B | 397B (MoE, 17B active) | ~10GB active | 1,493 | Massive model, low active params — needs multi-GPU for full weights | | google/gemma-4-E4B-it | 4B | ~3GB | 1,127 | Lightweight, fast inference on any GPU | | google/gemma-4-31B-it | 31B | ~19GB | 2,811 | Most-liked Gemma 4, fits 24GB at Q4 | | deepseek-ai/DeepSeek-V4-Pro | MoE | ~20GB+ | 4,405 | Requires SGLang v0.5.12+ or vLLM for full support |
Key model news:
- Qwen3.6-35B-A3B — MoE variant with only 3B active parameters. This means you get 35B-scale quality with ~8GB VRAM for Q4 inference. Significant efficiency win over the dense 27B.
- DeepSeek-V4-Pro — Now the #2 most-liked DeepSeek model (4,405 likes). Full inference support just arrived in SGLang v0.5.12.
- Gemma 4 family — Google's latest offering. The 31B-it variant (2,811 likes) is the sweet spot for 24GB GPUs.
⚙️ Inference Engine Updates
SGLang v0.5.12 (May 16) + v0.5.12.post1 (May 26):
- DeepSeek V4 full support — Parallelism: TP/EP/CP/Data Parallel Attention
- Hardware support: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
- HiSparse — Offloads inactive KV cache to CPU memory, extending context length
- post1 patch — 12 stability fixes, primarily for DeepSeek V4 on B200/B300
vLLM v0.21.0 (May 15):
- Transformers v4 deprecated — Migrate to transformers v5 (367 commits from 202 contributors)
- C++20 build requirement — Breaking build change for PyTorch compatibility
- KV Offloading — Improves context length for constrained VRAM
- v0.20.2 also fixed DeepSeek V4 sparse attention and gpt-oss/Qwen3-VL bugs
Ollama v0.30.0-rc29 (May 13):
- Major architecture change — Now directly supports llama.cpp instead of building on GGML
- MLX acceleration for Apple Silicon model inference
- GGUF file format compatibility maintained
Ollama v0.24.0 (May 14):
- Codex App support —
ollama launch codex-appfor OpenAI's desktop Codex experience - Parallel worktree support and git functionality
llama.cpp b9388 (May 29):
- MMVQ optimization for Turing GPUs (SM75)
- CUDA batch>=4 quantized matmul routing to MMQ on AMD MFMA hardware
- Daily release cycle continues — current build is b9388
📊 Worth Noting
MoE models are the efficiency play for 2026: Qwen3.6-35B-A3B (35B total, 3B active) and Qwen3.5-397B-A17B (397B total, 17B active) demonstrate that sparse MoE architectures are becoming practical for consumer hardware. The 35B-A3B fits in ~8GB VRAM at Q4 while delivering quality approaching its dense 27B sibling.
DeepSeek V4 ecosystem maturing: Both SGLang and vLLM now support DeepSeek V4, with vLLM v0.20.2 fixing sparse attention issues and SGLang v0.5.12 adding full parallelism support across Nvidia's latest hardware and AMD MI35X.
Ollama's re-architecture: The shift from GGML to direct llama.cpp integration (v0.30) suggests a cleaner separation of concerns. GGML becomes the file format layer, while llama.cpp handles the actual inference. This should improve compatibility and reduce maintenance burden.
Build toolchain changes: vLLM's C++20 requirement and transformers v5 migration signal that the inference stack is modernizing. If your build environment is stuck on C++17 or transformers 4.x, update before upgrading to vLLM 0.21.
Data sourced from HuggingFace API, GitHub release feeds, and automated scanning. Inference engines checked: llama.cpp b9388, Ollama v0.30.0-rc29, SGLang v0.5.12.post1, vLLM v0.21.0.