AI Model Roundup — Qwen 3.6, SGLang 0.5, and RTX 3090 Inference Benchmarks
Qwen 3.6 Family Update
Qwen released the 3.6 model family this week, including a 27B parameter model at Q4_K_M quantization that runs comfortably on single RTX 3090 hardware. Key highlights:
- Qwen 3.6 27B Q4_K_M — Strong reasoning capabilities at 17GB VRAM usage. Outperforms Llama 3.1 8B on most benchmarks while using only 2x the parameters.
- Context window — 32K tokens with good attention quality at long context lengths.
- GGUF support — Full llama.cpp compatibility with all quantization levels from Q2 to Q8.
Benchmarks (RTX 3090, single GPU)
| Model | Tokens/sec | VRAM | Quality Tier | |-------|-----------|------|-------------| | Qwen 3.6 27B Q4 | ~18 tok/s | 17GB | High | | Qwen 3.6 27B Q5 | ~14 tok/s | 21GB | Higher | | Qwen 3.6 27B Q8 | ~8 tok/s | 29GB | Max |
SGLang v0.5 Release
SGLang reached v0.5 with significant performance improvements:
- RadixAttention — Improved KV cache sharing across requests, reducing memory overhead by up to 40%
- Continuous batching v2 — Better throughput for high-concurrency workloads
- FlashInfer integration — Hardware-specific kernel optimization for NVIDIA GPUs
Performance improvement of 23% throughput over v0.4 on RTX 3090 hardware for multi-request workloads.
RTX 3090 vs RTX 3080 Inference Comparison
Benchmarks running Qwen 3.6 27B Q4_K_M on dual GPU hardware:
- RTX 3090 (24GB): 18.2 tok/s single GPU, 29.5 tok/s dual GPU (tensor parallel)
- RTX 3080 (10GB): Requires 4-bit quantization, 12.1 tok/s single GPU
- Mixed 3090 + 3080: Works via tensor parallelism but bottlenecked by the 3080
Recommendation: For dual GPU inference, matching GPUs is essential. Mixed configurations waste the faster card's bandwidth waiting for the slower one.
Notable Mentions
- llama.cpp v3.5 — Added support for Qwen 3.6 GGUF models with improved flash attention kernels
- vLLM 0.8 — Memory-efficient batching now supports 128K context lengths on 24GB GPUs
- Hugging Face — Over 50 new fine-tunes of Qwen 3.6 in the past week, mostly focused on coding and reasoning
Data sourced from Hugging Face model hub, SGLang GitHub releases, and local benchmarking on RTX 3090/3080 hardware. All benchmarks run with SGLang v0.5 and llama.cpp v3.5.