Model Intelligence — 2026-05-28

AI Model Intelligence — 2026-05-28

🤖 New Model Releases

Trending on HuggingFace (top picks for 10–24GB VRAM):

| Model | Params | VRAM (Q4) | Likes | Notes | |-------|--------|-----------|-------|-------| | Qwen/Qwen3.6-27B | 27B | ~17GB | 1,510 | Strong reasoning, 32K context, full GGUF support | | Qwen/Qwen3.6-35B-A3B | 35B (MoE, 3B active) | ~8GB active | 1,936 | MoE architecture — efficient inference, only 3B params active per token | | Qwen/Qwen3.5-397B-A17B | 397B (MoE, 17B active) | ~10GB active | 1,493 | Massive model, low active params — needs multi-GPU for full weights | | google/gemma-4-E4B-it | 4B | ~3GB | 1,127 | Lightweight, fast inference on any GPU | | google/gemma-4-31B-it | 31B | ~19GB | 2,811 | Most-liked Gemma 4, fits 24GB at Q4 | | deepseek-ai/DeepSeek-V4-Pro | MoE | ~20GB+ | 4,405 | Requires SGLang v0.5.12+ or vLLM for full support |

Key model news:

⚙️ Inference Engine Updates

SGLang v0.5.12 (May 16) + v0.5.12.post1 (May 26):

vLLM v0.21.0 (May 15):

Ollama v0.30.0-rc29 (May 13):

Ollama v0.24.0 (May 14):

llama.cpp b9388 (May 29):

📊 Worth Noting

MoE models are the efficiency play for 2026: Qwen3.6-35B-A3B (35B total, 3B active) and Qwen3.5-397B-A17B (397B total, 17B active) demonstrate that sparse MoE architectures are becoming practical for consumer hardware. The 35B-A3B fits in ~8GB VRAM at Q4 while delivering quality approaching its dense 27B sibling.

DeepSeek V4 ecosystem maturing: Both SGLang and vLLM now support DeepSeek V4, with vLLM v0.20.2 fixing sparse attention issues and SGLang v0.5.12 adding full parallelism support across Nvidia's latest hardware and AMD MI35X.

Ollama's re-architecture: The shift from GGML to direct llama.cpp integration (v0.30) suggests a cleaner separation of concerns. GGML becomes the file format layer, while llama.cpp handles the actual inference. This should improve compatibility and reduce maintenance burden.

Build toolchain changes: vLLM's C++20 requirement and transformers v5 migration signal that the inference stack is modernizing. If your build environment is stuck on C++17 or transformers 4.x, update before upgrading to vLLM 0.21.


Data sourced from HuggingFace API, GitHub release feeds, and automated scanning. Inference engines checked: llama.cpp b9388, Ollama v0.30.0-rc29, SGLang v0.5.12.post1, vLLM v0.21.0.

model-releasesinferencesglangollamavllmllama.cpp