Model Intelligence — 2026-06-06

AI Model Intelligence — 2026-06-06

🤖 Model Trends

No brand-new model families this cycle, but existing models show strong organic growth:

Qwen3.6-35B-A3B MoE (35B params with only 3B active) continues to be the standout for local GPU deployment — excellent performance/VRAM ratio.

🆕 Google QAT (Quantization-Aware Training) Models

Google released pre-quantized Gemma 3 models trained with QAT (Quantization-Aware Training) — these maintain much higher quality at 4-bit quantization compared to post-training quantization:

Why this matters: QAT models are trained to understand quantization during fine-tuning, so they don't lose quality when compressed. The 27B model runs at 4-bit with ~7GB VRAM — that's RTX 3060 territory. This changes the game for local deployment.

⚙️ Inference Engine Updates

This cycle's biggest story: inference engines are shipping at extraordinary pace.

🤔 Worth Watching

  1. Qwen3.6-35B-A3B MoE — 35B params with only 3B active, excellent for local GPU
  2. Gemma-3n-E4B-it — new ultra-efficient 4B, potential king of <6GB VRAM deployment
  3. llama.cpp velocity — 53 builds suggests major architecture support or quantization work
  4. FLUX.1-dev approaching #1 — Only 300 likes behind DeepSeek-R1, image generation may overtake reasoning on HF trending

Sources:

model-releasesinferencellama.cppollamavllmqwengemma