Inference Engines

A model file is just weights. To actually serve requests, you need an inference engine: software that loads the model, batches requests, manages KV cache, and exposes an API. The choice matters a lot — engines differ by 5–10x in throughput on the same hardware.

Why Plain PyTorch Isn't Enough

Running a HuggingFace model with model.generate() works for prototypes and dies in production. It misses:

Continuous / in-flight batching — adding new requests to a running batch as old ones finish, instead of waiting for a static batch.
PagedAttention — KV cache memory management that avoids fragmentation and overcommitment.
Fused kernels — collapsing multiple operations into single GPU calls.
Speculative decoding — drafting tokens with a small model, verifying with the large one.
Multi-tenant scheduling — fair time-sharing across many concurrent users.

Modern engines build all of this in.

The Major Engines

vLLM — the de facto open-source default. PagedAttention, continuous batching, broad model support, active community.
SGLang — competitive with or ahead of vLLM on many workloads, particularly strong with structured output and complex prompting patterns.
TensorRT-LLM — NVIDIA's engine. Highest throughput on NVIDIA hardware if you tune it; steeper learning curve.
Text Generation Inference (TGI) — HuggingFace's engine. Production-ready, integrates well with the HF ecosystem.
llama.cpp — CPU-and-everything inference. Quantized, portable, the dominant on-device runtime.
MLX (Apple) — Apple Silicon native. Surprising performance on M-series Macs.
Ollama — wraps llama.cpp with a friendly UX. Great for local development.

Picking One

Rough decision tree:

Prototyping / single user — Ollama or llama.cpp.
On-device / mobile / edge — llama.cpp, MLX, or TFLite.
Self-hosting on a cluster — vLLM or SGLang. TensorRT-LLM if you have the engineering bandwidth and want maximum throughput.
Behind an API for customers — anything battle-tested. The engine matters less than the operational maturity around it.

Throughput vs Latency

These trade off against each other. Larger batches = higher throughput, longer per-request latency. The right operating point depends on the product:

Batch jobs — maximize throughput, latency doesn't matter.
Chat — moderate latency floor, push throughput up to the floor.
Voice / interactive — low latency wins; throughput is secondary.

Most engines expose batch size, max batched tokens, and prefill/decode scheduling parameters. Tune them to your traffic shape.

KV Cache Is the Hidden Bottleneck

Generation is bottlenecked on memory bandwidth, and most of what's in memory during inference is the KV cache (per-token, per-layer attention state). Implications:

Long contexts are expensive primarily because the KV cache grows linearly with input length.
Many concurrent users all need their own KV cache; total VRAM is the constraint.
Prefix caching — sharing KV cache across requests with the same prefix is a huge win for chat-shaped or RAG workloads. vLLM and SGLang both support this.

Speculative Decoding

Use a small "draft" model to predict the next several tokens; verify them in parallel with the big model. When the draft is right (often the case for predictable text), you skip work. Common gains: 1.5–3x faster generation. Most modern engines support some form of it.

Multi-Model Serving

Serving many models on the same hardware requires extra machinery:

Hot-swapping — keep a few models in VRAM, page others in/out as traffic shifts.
LoRA serving — many adapters on top of one base model, near-zero switching cost.
Routing — direct each request to the engine instance that has the right model loaded.

If you're running fine-tuned variants for many customers, LoRA serving is usually the right pattern.

What This All Buys You

A well-tuned vLLM deployment can deliver 5–10x the throughput of a naive PyTorch setup on the same hardware. That's the difference between profitable self-hosted inference and an expensive disappointment.

On this page