> [MOOLAH]: The GPU Inference Node — Everything You Need to Know

> [ OVERVIEW ]: WHAT IS MOOLAH?

Moolah is the primary GPU inference node in the A1AI private cluster. Built around the RTX 5070 (8GB VRAM), it handles all 7B–14B model inference, ComfyUI image generation, and serves as the production AI workhorse behind this blog's automated daily newsroom.

> [ SPECS ]: HARDWARE CONFIGURATION

> GPU: RTX 5070 — 8GB GDDR7 VRAM, Blackwell sm_120 architecture
> CPU: AMD Ryzen 7 or Intel i7 (mid-tier, GPU is the bottleneck)
> RAM: 32GB DDR5
> Storage: 1TB NVMe SSD (OS + models)
> OS: Ubuntu 24.04 LTS + WSL2 optional
> Inference: Ollama + LiteLLM proxy

> [ PERFORMANCE ]: BENCHMARK NUMBERS

> Llama 3.1 8B: ~45–55 tokens/sec
> Gemma 3 12B: ~28–35 tokens/sec
> Average inference latency: 0.8s first token
> Cost per 1M tokens: $0.00

> [ SOFTWARE_STACK ]: WHAT RUNS ON MOOLAH

> Ollama (model server)
> LiteLLM (API proxy with model aliases)
> Open WebUI (local ChatGPT interface)
> ComfyUI (image generation via SDXL)
> Tailscale (zero-trust mesh member)

> [ MODELS ]: WHAT FITS IN 8GB VRAM

> Llama 3.1 8B (Q4_K_M) — primary workhorse
> Gemma 3 12B (Q4_K_M) — long context tasks
> Phi-3 Mini — fast lightweight reasoning
> SDXL Base 1.0 — image generation

> [ WANT_THIS ]: GET THE FULL BUILD

The Start Here guide has the exact part list and build sequence. The AI Homelab Blueprint includes the full n8n workflow and cost comparison. If you want a custom build designed for your workload, book an audit.