Quantization

Quantization is the process of representing model weights (and sometimes activations) in fewer bits than the original precision. It's the single most important technique for running large models on smaller hardware — and for serving them cheaply at scale.

Why It Works

Modern foundation models are massively over-parameterized for inference. Their weights live in a high-dynamic-range space, but most individual weights only need a few bits to capture the information that matters. Drop from 16-bit to 8-bit and the model is essentially unchanged on most tasks. Drop to 4-bit with the right method and the gap is small enough to be worth the savings on most workloads.

The Common Precisions

FP32 — full precision. Used for training; rarely for inference.
FP16 / BF16 — half precision. The default for inference on GPUs that support it. Negligible quality loss.
FP8 — eighth-precision. Hopper (H100) and Blackwell support it natively. Roughly half the memory of FP16, similar quality on calibrated models.
INT8 — 8-bit integer. Common for inference; tooling is mature.
INT4 — 4-bit. Aggressive, but with the right scheme (GPTQ, AWQ, GGUF Q4_K_M) the quality hit is small for many workloads.
INT2 / 1-bit — research-frontier; usable for some specific scenarios, generally too lossy for production.

Each step down roughly halves memory and bandwidth requirements.

Weight-Only vs Weight-and-Activation

Weight-only quantization — weights are stored in low precision, dequantized on the fly during compute. Memory and bandwidth win, modest compute cost. The most common form.
Activation quantization — activations are also kept in low precision through compute. Bigger speedups, more careful calibration required to preserve quality.

Most production deployments do weight-only INT8 or INT4 with FP16 or BF16 activations.

Methods to Know

GPTQ — post-training quantization with error correction. Strong quality at INT4.
AWQ — activation-aware quantization. Preserves the weights that matter most based on activation statistics.
GGUF (formerly GGML) — quantization format used by llama.cpp. Multiple variants (Q4_K_M, Q5_K_M, Q8_0, etc.) trading quality and size.
SmoothQuant — pre-processes weights to make activation quantization easier.
FP8 calibration — for engines and hardware that natively support FP8, calibrate scales for stable activation quantization.

Quality Impact

Roughly:

FP16 → FP8 / INT8 — typically less than 1% quality drop on standard benchmarks. Often imperceptible.
INT8 → INT4 — 1–5% drop with good methods, more with naive ones. Task-dependent: chat-style generation is more forgiving than code or math.
Below INT4 — quality starts to fall noticeably. Use only when memory pressure forces it.

Always evaluate on your own task, not just public benchmarks. A model that's "fine" at INT4 on MMLU might fail your specific use case.

When to Quantize

Self-hosting on smaller GPUs — INT4 lets a 70B model fit on a single GPU it otherwise wouldn't.
On-device / edge — almost always quantized; bandwidth and memory are king.
Cost-driven serving — quantized models serve more concurrent users per GPU.
Long-context inference — KV cache quantization (separate technique) compounds the savings for long-context workloads.

When Not To

Training — quantization is for inference; mixed-precision training uses different techniques.
Quality-critical evaluation — don't quantize the model you're using as ground truth.
Already-fast workloads — if you're not hardware-constrained, the quality risk isn't worth the savings.

The Operational View

Quantization isn't free. You introduce a calibration step, a quality-vs-size tradeoff to manage, and a new artifact to version. It's worth it for serious self-hosted deployments and not worth it for prototypes. Most production systems end up at INT8 or INT4 weight-only quantization as the default.

On this page