How I Built an AI Model Tracker Using Only Local Inference
How I Built an AI Model Tracker Using Only Local Inference
I just launched Model Intelligence — an automated blog that tracks AI model releases, benchmarks, and pricing. The interesting part: it's entirely built and maintained using local inference on homestead GPUs.
No OpenAI API calls. No cloud LLM costs. No monthly subscriptions. Just two consumer GPUs running in a Docker container on Proxmox.
Here's exactly how I did it.
The Hardware: Homestead GPU Lab
Location: Sunbreak Forest Farm, Issaquah WA (yes, the goat barn is nearby)
GPUs:
- RTX 3090 (24GB) — "Thinking Node" for heavy models
- RTX 3080 (10GB) — "Fast Node" for quick responses
- Multiple RTX 3060s (12GB each) — swarm for parallel tasks
Cooling: Custom liquid cooling loop with Docker containers running nvidia-container-toolkit for GPU passthrough. The GPUs are in a rack-mounted setup with custom water blocks. Yes, I built the cooling myself.
Infrastructure:
Proxmox Host
├── Coolify Main VM (services, CI/CD)
│ └── Hermes Agent → runs the blog
└── Coolify GPU VM (inference)
├── RTX 3090 (SGLang + llama.cpp)
└── RTX 3080 (SGLang + llama.cpp)
The Software Stack
Hermes Agent (The Brain)
Hermes Agent is an open-source AI agent framework that runs locally. I configured it to:
- Run on local inference — uses Qwen3.6-27B model via SGLang
- Schedule cron jobs — automated daily scans
- Access tools — terminal, browser, file system, web search
- Communicate via Telegram — I manage everything from my phone
Key config:
model: Qwen_Qwen3.6-27B-Q4_K_M.gguf
provider: custom (llama.cpp)
tools: [terminal, file, web, browser, search]
SGLang (The Engine)
SGLang is the fastest inference engine for my use case. Why?
- Speculative decoding (MTP) — 2-3x speedup using model's own draft heads
- RadixAttention — instant prefix caching for repeated prompts
- AWQ quantization — 45 tokens/sec on RTX 3090 with Qwen3.6-27B-AWQ
Performance comparison on RTX 3090:
| Engine | Model | Tokens/sec | First Token | Context | |--------|-------|-----------|-------------|---------| | SGLang + AWQ | Qwen3.6-27B | 45 TPS | 196ms | 8K | | llama.cpp + Q4 | Qwen3.6-27B | 38 TPS | 633ms | 40K | | SGLang + MoE | Gemma-4-MoE | 26 TPS | 2451ms | 32K |
Source: Coolify GPU Models README
The Docker Setup
Each model runs in its own container with GPU passthrough:
# docker-compose.yml (SGLang on RTX 3090)
services:
sglang:
image: lmsysorg/sglang:latest
environment:
- CUDA_VISIBLE_DEVICES=0 # RTX 3090
volumes:
- hf_cache:/root/.cache # Model persistence
command: >
python3 -m sglang.launch_server
--model Qwen/Qwen3.6-27B-AWQ
--port 11436
--mem-fraction-static 0.85
Full compose files: coolify-gpu-models repo
The Blog Build System
Static site generated via Node.js:
// Simplified build process
gray-matter (YAML frontmatter) → remark (Markdown → HTML)
→ JSON-LD schema injection
→ Cloudflare Pages deployment
The build script handles:
- Markdown parsing with remark
- JSON-LD structured data (SEO)
- Sitemap.xml generation
- RSS feed
- Cloudflare cache headers
How the Automation Works
Daily Cron Job (Every 12 Hours)
The job runs this sequence:
- Scan HuggingFace — new models with 1000+ likes
- Check GitHub releases — SGLang, vLLM, llama.cpp, Ollama
- Fetch benchmarks — HF Open LLM Leaderboard, LMSYS Arena
- Cross-reference data — multiple sources per model
- Generate markdown — with source links everywhere
- Build the site — Node.js static generation
- Deploy — push to Cloudflare Pages
Content Generation
The AI agent writes content using this structure:
# Model Intelligence — 2026-05-28
## New Model Releases
**Qwen3.6-27B** — 1,510 likes on [HuggingFace](https://huggingface.co/Qwen/Qwen3.6-27B)
- 27B params, 17GB VRAM at Q4
- [Benchmark scores](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- Local inference: 45 TPS on RTX 3090 (SGLang + AWQ)
## Inference Engine Updates
**SGLang v0.5.12** — [Release Notes](https://github.com/sgl-project/sglang/releases)
- Parallelism: TP/EP/CP/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200, AMD MI35X
## Pricing Comparison
| Model | Local (RTX 3090) | Cloud API | Source |
|-------|-----------------|-----------|--------|
| Qwen3.6-27B | Free (homestead) | $0.80/M tokens | [DeepSeek](https://platform.deepseek.com/) |
Every claim has a source link. Every benchmark is cross-referenced. Every price has a provider link.
Why Local Inference Matters
Cost Comparison
| Approach | Monthly Cost | Speed | Privacy | |----------|-------------|-------|---------| | Local (RTX 3090) | $0 (after hardware) | 45 TPS | 100% | | OpenAI API | $20-50/mo | 100+ TPS | 0% | | Cloud GPU (RunPod) | $50-100/mo | 45 TPS | 50% |
After the initial hardware investment (~$2000 for both GPUs), the marginal cost of inference is electricity. At my homestead rate: ~$5/month for 24/7 operation.
Privacy & Sovereignty
- No data leaves the homestead — all processing happens locally
- No API rate limits — unlimited generation
- No vendor lock-in — open-source models, open-source tools
- Full audit trail — git commits, cron logs, benchmark data
Performance
SGLang + AWQ on RTX 3090:
- 45 tokens/sec — fast enough for real-time conversation
- 196ms first token — responsive for interactive use
- 8K context — sufficient for most tasks
For comparison, this is comparable to paid API tiers while costing nothing per request.
Lessons Learned
What Worked
- SGLang is the fastest engine for my use case (2-3x llama.cpp on large models)
- AWQ quantization gives the best speed/quality tradeoff
- Docker containers make GPU management trivial
- Hermes Agent cron jobs automate everything
- Static site generation is fast and reliable
What Didn't
- Signal integration — still fighting SSL issues (see: Signal truststore problems)
- First boot is slow — SGLang takes 15-20 minutes for model download + JIT compilation
- VRAM limits — MoE models are tricky (Gemma-4-MoE needs more than expected)
Pro Tips
- Pre-download models to avoid slow first boots
- Use named volumes for model persistence across restarts
- Set health checks with long start periods (1800s)
- Monitor VRAM with
nvidia-smibefore deploying new models - Benchmark everything — paper numbers ≠ real numbers
The Result
A fully automated AI model intelligence blog that:
- Publishes daily without manual intervention
- Cross-references benchmarks from multiple sources
- Includes pricing and cost estimates
- Has source links for every claim
- Runs on homestead hardware
- Costs nothing but electricity to operate
Try it: ai-updates.pages.dev
Code: github.com/jerfletcher/ai-updates (coming soon — currently private)
GPU setup: coolify-gpu-models repo
Built on homestead GPUs with local inference. No cloud APIs were harmed.