Local AI Setup: From Zero to Production
Running AI agents locally gives you zero API costs, complete privacy, and offline capability. This guide walks you through the entire setup process — from picking hardware to running your first local agent in production.
Step 1: Hardware Selection
Your GPU determines which models you can run and how fast. Here are realistic tiers for 2026:
For most developers, an RTX 3060 12GB hits the sweet spot: runs 7-8B models comfortably in Q4_K_M quantization with room for context processing. A GTX 1060 6GB is the budget champion — handles 4B models with adapters perfectly for code review, documentation, and audit tasks.
Step 2: Install Ollama
Ollama is the standard tool for running LLMs locally. It handles model downloading, quantization, GPU acceleration, and provides an OpenAI-compatible API.
If Ollama detects your GPU, you will see CUDA being used in the verbose output. If not, ensure your NVIDIA drivers are up to date (version 525+ for CUDA 12 support).
Step 3: Pull and Test Models
Start with a small model to verify your setup, then move to a larger one:
Step 4: Configure OpenClaw for Local Models
OpenClaw connects to Ollama automatically. Add this to your OpenClaw config:
Now OpenClaw will use your local Qwen3-4B model for all agent tasks. No API keys, no internet required.
Step 5: Run Your First Local Agent
With Ollama running and OpenClaw configured, you are ready to run agents locally:
- Browse the FlickClaw agent catalog and select an agent.
- Choose “Ollama” as the export format.
- Drop the generated agent files into your OpenClaw workspace.
- The agent runs against your local model with zero configuration needed.
Performance tip: For local models, use agents with quality gates enabled. Quality gates are deterministic (no model involvement), so they add reliability without adding latency.
Troubleshooting Common Issues
Ensure nvidia-smi works in your terminal. On WSL2, install the NVIDIA CUDA WSL2 package. On Linux, verify nvidia-container-toolkit is installed.
Reduce context size. Most models default to 2048 tokens. For code review, 4096 is practical. For full-repository analysis, 8192+ may be needed — but you will need more VRAM.
Ensure GPU acceleration is active. Check that the model fits entirely in VRAM — if it spills to system RAM, performance drops 10-50x.