Local Development
Running models on your own machine — for prototyping, privacy, offline work, and faster iteration
There's a meaningful productivity boost to running models locally. No API keys, no rate limits, no per-call cost, no network round trip. Modern open-weights models, modest hardware, and good tooling have made this practical for a wide range of tasks. It's not a substitute for frontier APIs in production, but for development it's often a better daily driver than people assume.
Why Run Locally
- Iteration speed. No network latency, no rate limits.
- Privacy. Sensitive code, internal docs, customer data — none of it leaves your machine.
- Offline. Works on planes, in tunnels, in regions with poor connectivity.
- Cost. Free after the upfront hardware investment.
- Determinism. Easier to reproduce runs without provider-side variability.
- Capability check. If a 7B model running locally is good enough for the task, you've learned a lot about the problem.
What's Actually Possible
A modern laptop with 16–32GB of RAM can comfortably run:
- 7–8B parameter models at 4-bit quantization, fast enough for chat.
- 13B models, slower but usable.
- Embedding models with no perceptible latency.
- Whisper for transcription.
- Smaller image generation models.
A workstation with 64GB+ RAM or an Apple Silicon Mac with a unified memory pool can run 70B models at usable speeds. A consumer GPU with 24GB VRAM (RTX 4090, 5090) covers most things up to ~30B.
You won't run frontier-quality models at frontier speeds. You don't need to for most development work.
The Major Tools
Ollama — the most popular local-model runner. CLI to pull and run models, OpenAI-compatible API, broad model library, runs on Mac/Linux/Windows. Wraps llama.cpp under the hood. The path of least resistance for almost anyone starting.
LM Studio — desktop app with a GUI for browsing, downloading, and chatting with models. Also exposes an OpenAI-compatible server. Friendly for non-CLI users; useful even for engineers who just want a quick chat UI.
llama.cpp — the core C++ inference engine that powers most consumer-grade local model running. Direct use is for people who want maximum control or are embedding it into other software.
MLX — Apple's machine learning framework, optimized for Apple Silicon. Strong performance on M-series chips. Tools like mlx-lm provide LLM-specific entry points.
vLLM / SGLang — the production engines from the Infrastructure section. Overkill on a laptop, but the right choice if "local" means a workstation or a small on-prem GPU box for a team.
Jan, GPT4All, Open WebUI — UI layers for local models, often paired with Ollama or LM Studio underneath. Useful for non-technical users in your team or family.
OpenAI-Compatible APIs
Most of these tools speak the OpenAI Chat Completions format. Practical implication: you can swap your dev environment to local by changing one base URL, with no other code changes. This is the single most useful pattern for local development:
client = OpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="ignored"
)The same code that calls GPT-4 or Claude in production calls a local Llama in development. Iteration speed up, costs at zero.
Picking a Model
For local development with text:
- Llama 3.x / 4 series — broad capability, strong baseline.
- Qwen series — strong multilingual and coding performance.
- Mistral / Mixtral — efficient, especially the mixture-of-experts variants.
- Gemma 2 / 3 — Google's open models, well-supported.
- DeepSeek family — strong on coding and reasoning.
- Phi series — small but capable, good for memory-constrained environments.
For embeddings, code, or domain-specific work, the model zoo is wider; HuggingFace's leaderboards are the obvious starting point.
Start with the smallest model that's good enough. Going from 70B to 7B saves an order of magnitude in resources.
Quantization in Practice
Local models are almost always quantized:
- Q4_K_M (4-bit) — the default for most local use. Good quality, half the memory of 8-bit.
- Q5_K_M / Q6_K — slightly higher quality, more memory.
- Q8_0 — near-lossless, twice the memory of Q4.
For development, Q4 is usually fine. For evaluation work where quality matters, step up.
Where Local Stops Being the Right Answer
- Frontier capability required. No 70B local model is GPT-5 or Claude 4.7.
- Long context. Local KV cache eats memory fast; 100K+ context is hard on a laptop.
- Concurrent users. Local is single-user; multi-user serving needs real infrastructure.
- Multimodal at scale. Vision and audio models work locally but are expensive in memory and compute.
A Healthy Pattern
Many teams converge on: local for development, frontier API for production. Same code path, same prompt structures, swap the base URL and model name. The local environment catches simple bugs and lets you iterate fast; the production environment delivers quality. Both are useful; neither replaces the other.