How I Built an AI Model Tracker Using Only Local Inference

How I Built an AI Model Tracker Using Only Local Inference

I just launched Model Intelligence — an automated blog that tracks AI model releases, benchmarks, and pricing. The interesting part: it's entirely built and maintained using local inference on homestead GPUs.

No OpenAI API calls. No cloud LLM costs. No monthly subscriptions. Just two consumer GPUs running in a Docker container on Proxmox.

Here's exactly how I did it.

The Hardware: Homestead GPU Lab

Location: Sunbreak Forest Farm, Issaquah WA (yes, the goat barn is nearby)

GPUs:

Cooling: Custom liquid cooling loop with Docker containers running nvidia-container-toolkit for GPU passthrough. The GPUs are in a rack-mounted setup with custom water blocks. Yes, I built the cooling myself.

Infrastructure:

Proxmox Host
├── Coolify Main VM (services, CI/CD)
│   └── Hermes Agent → runs the blog
└── Coolify GPU VM (inference)
    ├── RTX 3090 (SGLang + llama.cpp)
    └── RTX 3080 (SGLang + llama.cpp)

The Software Stack

Hermes Agent (The Brain)

Hermes Agent is an open-source AI agent framework that runs locally. I configured it to:

  1. Run on local inference — uses Qwen3.6-27B model via SGLang
  2. Schedule cron jobs — automated daily scans
  3. Access tools — terminal, browser, file system, web search
  4. Communicate via Telegram — I manage everything from my phone

Key config:

model: Qwen_Qwen3.6-27B-Q4_K_M.gguf
provider: custom (llama.cpp)
tools: [terminal, file, web, browser, search]

SGLang (The Engine)

SGLang is the fastest inference engine for my use case. Why?

Performance comparison on RTX 3090:

| Engine | Model | Tokens/sec | First Token | Context | |--------|-------|-----------|-------------|---------| | SGLang + AWQ | Qwen3.6-27B | 45 TPS | 196ms | 8K | | llama.cpp + Q4 | Qwen3.6-27B | 38 TPS | 633ms | 40K | | SGLang + MoE | Gemma-4-MoE | 26 TPS | 2451ms | 32K |

Source: Coolify GPU Models README

The Docker Setup

Each model runs in its own container with GPU passthrough:

# docker-compose.yml (SGLang on RTX 3090)
services:
  sglang:
    image: lmsysorg/sglang:latest
    environment:
      - CUDA_VISIBLE_DEVICES=0  # RTX 3090
    volumes:
      - hf_cache:/root/.cache   # Model persistence
    command: >
      python3 -m sglang.launch_server
      --model Qwen/Qwen3.6-27B-AWQ
      --port 11436
      --mem-fraction-static 0.85

Full compose files: coolify-gpu-models repo

The Blog Build System

Static site generated via Node.js:

// Simplified build process
gray-matter (YAML frontmatter) → remark (Markdown → HTML)
  → JSON-LD schema injection
  → Cloudflare Pages deployment

The build script handles:

How the Automation Works

Daily Cron Job (Every 12 Hours)

The job runs this sequence:

  1. Scan HuggingFace — new models with 1000+ likes
  2. Check GitHub releases — SGLang, vLLM, llama.cpp, Ollama
  3. Fetch benchmarks — HF Open LLM Leaderboard, LMSYS Arena
  4. Cross-reference data — multiple sources per model
  5. Generate markdown — with source links everywhere
  6. Build the site — Node.js static generation
  7. Deploy — push to Cloudflare Pages

Content Generation

The AI agent writes content using this structure:

# Model Intelligence — 2026-05-28

## New Model Releases

**Qwen3.6-27B** — 1,510 likes on [HuggingFace](https://huggingface.co/Qwen/Qwen3.6-27B)
- 27B params, 17GB VRAM at Q4
- [Benchmark scores](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- Local inference: 45 TPS on RTX 3090 (SGLang + AWQ)

## Inference Engine Updates

**SGLang v0.5.12** — [Release Notes](https://github.com/sgl-project/sglang/releases)
- Parallelism: TP/EP/CP/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200, AMD MI35X

## Pricing Comparison

| Model | Local (RTX 3090) | Cloud API | Source |
|-------|-----------------|-----------|--------|
| Qwen3.6-27B | Free (homestead) | $0.80/M tokens | [DeepSeek](https://platform.deepseek.com/) |

Every claim has a source link. Every benchmark is cross-referenced. Every price has a provider link.

Why Local Inference Matters

Cost Comparison

| Approach | Monthly Cost | Speed | Privacy | |----------|-------------|-------|---------| | Local (RTX 3090) | $0 (after hardware) | 45 TPS | 100% | | OpenAI API | $20-50/mo | 100+ TPS | 0% | | Cloud GPU (RunPod) | $50-100/mo | 45 TPS | 50% |

After the initial hardware investment (~$2000 for both GPUs), the marginal cost of inference is electricity. At my homestead rate: ~$5/month for 24/7 operation.

Privacy & Sovereignty

Performance

SGLang + AWQ on RTX 3090:

For comparison, this is comparable to paid API tiers while costing nothing per request.

Lessons Learned

What Worked

  1. SGLang is the fastest engine for my use case (2-3x llama.cpp on large models)
  2. AWQ quantization gives the best speed/quality tradeoff
  3. Docker containers make GPU management trivial
  4. Hermes Agent cron jobs automate everything
  5. Static site generation is fast and reliable

What Didn't

  1. Signal integration — still fighting SSL issues (see: Signal truststore problems)
  2. First boot is slow — SGLang takes 15-20 minutes for model download + JIT compilation
  3. VRAM limits — MoE models are tricky (Gemma-4-MoE needs more than expected)

Pro Tips

  1. Pre-download models to avoid slow first boots
  2. Use named volumes for model persistence across restarts
  3. Set health checks with long start periods (1800s)
  4. Monitor VRAM with nvidia-smi before deploying new models
  5. Benchmark everything — paper numbers ≠ real numbers

The Result

A fully automated AI model intelligence blog that:

Try it: ai-updates.pages.dev

Code: github.com/jerfletcher/ai-updates (coming soon — currently private)

GPU setup: coolify-gpu-models repo


Built on homestead GPUs with local inference. No cloud APIs were harmed.

local-inferencehermes-agentgpu-homelabsglangautomation