Model Serving
Turning a running inference engine into a reliable production service
A model loaded into vLLM is the kernel of a serving system, not the whole thing. Around it goes everything that makes any production service real: load balancing, autoscaling, observability, deploy hygiene, multi-tenancy. The patterns are familiar from microservices — with a few twists specific to GPU workloads.
The Reference Architecture
A common shape:
[clients]
↓
[gateway / auth / rate limit]
↓
[router] — model and tenant aware
↓
[inference replicas] — vLLM/SGLang behind GPU pools
↓
[shared cache / vector DB / tools]Every layer earns its place. The gateway handles auth and quotas; the router picks the right replica; the inference layer does the heavy work; shared services keep state out of the inference replicas so they can scale.
Routing
A serving system that runs multiple models, multiple tiers, or multiple regions needs deliberate routing:
- Model routing — direct each request to a replica with the right model loaded.
- Tenant routing — pin tenants to replicas if they share a fine-tuned variant or LoRA.
- Capacity-aware routing — avoid overloaded replicas; queue or shed when capacity is tight.
- Region routing — serve users from the closest GPU pool.
Naive round-robin works for one model on identical hardware. Anything more complex needs a real router.
Autoscaling Is Hard
Autoscaling stateless web services is well-understood. GPU-backed services are different:
- Cold starts are slow. Loading a 70B model into VRAM takes minutes, not seconds.
- GPUs are expensive idle. You can't oversize generously.
- Capacity isn't on demand. Spinning up new instances might block on cluster capacity.
- Steady warming matters more than aggressive scale-down.
Pragmatic patterns: scale on queue depth or token-per-second utilization rather than CPU; pre-warm replicas during predictable traffic ramps; keep a small reserve of warm capacity.
Multi-Tenancy
Serving many users from shared GPUs needs guardrails:
- Per-tenant rate limits — both request rate and token throughput.
- Fair-share scheduling — one big customer shouldn't starve everyone else.
- Isolation for sensitive workloads — separate replica pools, not shared.
- Quota enforcement — at the gateway, not deep in the inference engine.
High Availability
Production patterns:
- Multi-replica per model — a single replica is a single point of failure.
- Multi-region — for users globally, and for cloud-region outages.
- Multi-provider failover — if you depend on a single API provider, plan for their outages.
- Graceful degradation — when capacity is tight, serve a smaller model rather than fail.
The fallback paths are part of the SLO, not an afterthought.
Observability for Serving
On top of standard service metrics (RPS, error rate, latency percentiles), GPU serving needs:
- GPU utilization and memory. A replica at 30% utilization is wasted money; one at 99% memory is about to OOM.
- Batch size and queue depth. The leading indicator of capacity issues.
- Tokens-per-second per replica. The actual unit of useful work.
- KV cache hit rate. The leading indicator of caching effectiveness.
Deployment
Treat model artifacts like any other releasable artifact:
- Versioned model files. Pinned, immutable, reproducible.
- Canary rollouts. Send 1% of traffic to a new model version and measure.
- A/B testing infrastructure. Compare versions on production traffic.
- One-flag rollback. Always have a way back.
Serverless / Managed Options
Not everyone should run their own inference fleet. Managed options:
- Provider APIs (Anthropic, OpenAI, Google, AWS Bedrock) — fastest path, no infrastructure.
- Cloud GPU APIs (AWS SageMaker, GCP Vertex, Azure ML) — managed serving for your own models.
- Inference platforms (Together, Fireworks, Replicate, Modal, Anyscale) — managed serving for open-weights models, often with their own optimizations.
The "build vs buy" math for serving has moved a lot in the buy direction over the last two years. Self-hosting is now mostly justified by scale, customization, or compliance — not by cost.
A Sanity Check
A serving setup is healthy when you can: deploy a new model version with a single command, route 1% of traffic to it, watch dashboards for an hour, ramp to 100%, and roll back instantly if needed. Without that loop, you'll regret every model upgrade.