How I Built an AI Model Tracker Using Only Local Inference

2026-05-29 ·Jeremy Fletcher 5 min read

How I Built an AI Model Tracker Using Only Local Inference

I just launched Model Intelligence — an automated blog that tracks AI model releases, benchmarks, and pricing. The interesting part: it's entirely built and maintained using local inference on homestead GPUs.

No OpenAI API calls. No cloud LLM costs. No monthly subscriptions. Just two consumer GPUs running in a Docker container on Proxmox.

Here's exactly how I did it.

The Hardware: Homestead GPU Lab

Location: Sunbreak Forest Farm, Issaquah WA (yes, the goat barn is nearby)

GPUs:

RTX 3090 (24GB) — "Thinking Node" for heavy models
RTX 3080 (10GB) — "Fast Node" for quick responses
Multiple RTX 3060s (12GB each) — swarm for parallel tasks

Cooling: Custom liquid cooling loop with Docker containers running nvidia-container-toolkit for GPU passthrough. The GPUs are in a rack-mounted setup with custom water blocks. Yes, I built the cooling myself.

Infrastructure:

Proxmox Host
├── Coolify Main VM (services, CI/CD)
│   └── Hermes Agent → runs the blog
└── Coolify GPU VM (inference)
    ├── RTX 3090 (SGLang + llama.cpp)
    └── RTX 3080 (SGLang + llama.cpp)

The Software Stack

Hermes Agent (The Brain)

Hermes Agent is an open-source AI agent framework that runs locally. I configured it to:

Run on local inference — uses Qwen3.6-27B model via SGLang
Schedule cron jobs — automated daily scans
Access tools — terminal, browser, file system, web search
Communicate via Telegram — I manage everything from my phone

Key config:

model: Qwen_Qwen3.6-27B-Q4_K_M.gguf
provider: custom (llama.cpp)
tools: [terminal, file, web, browser, search]

SGLang (The Engine)

SGLang is the fastest inference engine for my use case. Why?

Speculative decoding (MTP) — 2-3x speedup using model's own draft heads
RadixAttention — instant prefix caching for repeated prompts
AWQ quantization — 45 tokens/sec on RTX 3090 with Qwen3.6-27B-AWQ

Performance comparison on RTX 3090:

Engine	Model	Tokens/sec	First Token	Context
SGLang + AWQ	Qwen3.6-27B	45 TPS	196ms	8K
llama.cpp + Q4	Qwen3.6-27B	38 TPS	633ms	40K
SGLang + MoE	Gemma-4-MoE	26 TPS	2451ms	32K

Source: Coolify GPU Models README

The Docker Setup

Each model runs in its own container with GPU passthrough:

# docker-compose.yml (SGLang on RTX 3090)
services:
  sglang:
    image: lmsysorg/sglang:latest
    environment:
      - CUDA_VISIBLE_DEVICES=0  # RTX 3090
    volumes:
      - hf_cache:/root/.cache   # Model persistence
    command: >
      python3 -m sglang.launch_server
      --model Qwen/Qwen3.6-27B-AWQ
      --port 11436
      --mem-fraction-static 0.85

Full compose files: coolify-gpu-models repo

The Blog Build System

Static site generated via Node.js:

// Simplified build process
gray-matter (YAML frontmatter) → remark (Markdown → HTML)
  → JSON-LD schema injection
  → Cloudflare Pages deployment

The build script handles:

How the Automation Works

Daily Cron Job (Every 12 Hours)

The job runs this sequence:

Scan HuggingFace — new models with 1000+ likes
Check GitHub releases — SGLang, vLLM, llama.cpp, Ollama
Fetch benchmarks — HF Open LLM Leaderboard, LMSYS Arena
Cross-reference data — multiple sources per model
Generate markdown — with source links everywhere
Build the site — Node.js static generation
Deploy — push to Cloudflare Pages

Content Generation

The AI agent writes content using this structure:

# Model Intelligence — 2026-05-28

## New Model Releases

**Qwen3.6-27B** — 1,510 likes on [HuggingFace](https://huggingface.co/Qwen/Qwen3.6-27B)
- 27B params, 17GB VRAM at Q4
- [Benchmark scores](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- Local inference: 45 TPS on RTX 3090 (SGLang + AWQ)

## Inference Engine Updates

**SGLang v0.5.12** — [Release Notes](https://github.com/sgl-project/sglang/releases)
- Parallelism: TP/EP/CP/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200, AMD MI35X

## Pricing Comparison

| Model | Local (RTX 3090) | Cloud API | Source |
|-------|-----------------|-----------|--------|
| Qwen3.6-27B | Free (homestead) | $0.80/M tokens | [DeepSeek](https://platform.deepseek.com/) |

Every claim has a source link. Every benchmark is cross-referenced. Every price has a provider link.

Why Local Inference Matters

Cost Comparison

Approach	Monthly Cost	Speed	Privacy
Local (RTX 3090)	$0 (after hardware)	45 TPS	100%
OpenAI API	$20-50/mo	100+ TPS	0%
Cloud GPU (RunPod)	$50-100/mo	45 TPS	50%

After the initial hardware investment ($2000 for both GPUs), the marginal cost of inference is electricity. At my homestead rate: **$5/month for 24/7 operation**.

Privacy & Sovereignty

No data leaves the homestead — all processing happens locally
No API rate limits — unlimited generation
No vendor lock-in — open-source models, open-source tools
Full audit trail — git commits, cron logs, benchmark data

Performance

SGLang + AWQ on RTX 3090:

45 tokens/sec — fast enough for real-time conversation
196ms first token — responsive for interactive use
8K context — sufficient for most tasks

For comparison, this is comparable to paid API tiers while costing nothing per request.

Lessons Learned

What Worked

SGLang is the fastest engine for my use case (2-3x llama.cpp on large models)
AWQ quantization gives the best speed/quality tradeoff
Docker containers make GPU management trivial
Hermes Agent cron jobs automate everything
Static site generation is fast and reliable

What Didn't

Signal integration — still fighting SSL issues (see: Signal truststore problems)
First boot is slow — SGLang takes 15-20 minutes for model download + JIT compilation
VRAM limits — MoE models are tricky (Gemma-4-MoE needs more than expected)

Pro Tips

Pre-download models to avoid slow first boots
Use named volumes for model persistence across restarts
Set health checks with long start periods (1800s)
Monitor VRAM with nvidia-smi before deploying new models
Benchmark everything — paper numbers ≠ real numbers

The Result

A fully automated AI model intelligence blog that:

Publishes daily without manual intervention
Cross-references benchmarks from multiple sources
Includes pricing and cost estimates
Has source links for every claim
Runs on homestead hardware
Costs nothing but electricity to operate

Try it: ai-updates.pages.dev

Code: github.com/jerfletcher/ai-updates (coming soon — currently private)

GPU setup: coolify-gpu-models repo

Built on homestead GPUs with local inference. No cloud APIs were harmed.

local-inferencehermes-agentgpu-homelabsglangautomation