Distillation
Capture the behavior of a big expensive model in a small cheap one
Distillation is the practice of training a smaller "student" model to mimic the behavior of a larger "teacher" model. Done well, it produces a model that's a fraction of the size, runs a fraction of the cost, and approaches the teacher's quality on the target tasks. It's one of the most pragmatic levers in production AI.
Why Distill
- Cost. A distilled 8B model serving a narrow task often matches frontier-API quality at 1–10% of the cost.
- Latency. Smaller models are faster, sometimes by a lot. Critical for interactive UX.
- Privacy. A self-hosted distilled model never leaves your infrastructure.
- Reliability. No external dependency, no rate limits, no provider deprecations.
- Customization. The distilled model behaves how you trained it, not how the API decides this quarter.
The tradeoff is upfront work and ongoing maintenance. Don't distill until you've validated that the API path is the bottleneck.
What Counts as Distillation
The term covers a range of techniques:
- Sequence-level distillation. Generate a dataset of (input, teacher output) pairs; fine-tune a smaller model on it. Simple, common, often called "synthetic data fine-tuning" rather than distillation.
- Logit distillation. Train the student to match the teacher's full output distribution, not just the chosen tokens. Stronger signal; requires access to the teacher's logits.
- Feature distillation. Match intermediate representations. Mostly used in research.
- Rationale distillation. The teacher generates step-by-step reasoning; the student is trained to produce the same rationale. Useful for distilling reasoning capability.
For most production use, sequence-level distillation is the default — it's straightforward and works.
When Distillation Actually Helps
- Narrow, repeatable tasks. Customer support classification, structured extraction, intent detection, code completion in a specific style.
- Tasks where the teacher is consistently right. If the teacher is wrong 30% of the time, the student inherits 30% wrongness.
- High volume. The cost savings justify the engineering investment.
- Latency-sensitive inputs. Where every saved millisecond matters.
Where It Backfires
- Open-ended tasks. General conversational quality is hard to distill; the student often loses subtle capabilities.
- Multi-step reasoning. Frontier-level reasoning rarely survives distillation cleanly.
- Long-tail capability. The student is good at the distribution it saw in training; novel inputs fall off a cliff.
- Low-volume features. The cost of distillation isn't recouped if usage stays small.
The Distillation Pipeline
A practical recipe:
- Define the task narrowly. "Extract these five fields from invoices in JSON." Not "be a helpful assistant."
- Collect or generate inputs. Real production inputs are best; synthetic inputs work as a supplement.
- Run the teacher. Generate teacher outputs across all inputs.
- Filter ruthlessly. Drop teacher outputs that are wrong, malformed, or low-confidence.
- Pick a base model. Open-weights model in the right size class (1B–14B is common).
- Fine-tune. LoRA is usually sufficient; full fine-tuning if the gap to the teacher is large.
- Evaluate. Compare against the teacher and against the previous baseline.
- Iterate. Identify failure modes, generate or collect more data for them, retrain.
Each loop costs hours to days; budget for several before reaching production quality.
Quality vs. Capability Coverage
A distilled model is not a smaller version of the teacher with all capabilities intact. It is good at what you trained it on and possibly worse than the original base model on everything else. Two implications:
- Catastrophic forgetting. Aggressive fine-tuning can erase general capabilities. Mix some general data into the training set if those capabilities matter.
- Out-of-distribution behavior. When the student sees an input pattern unlike anything in training, the failure can be worse than the base model would have been. Detect and route these to the API path.
Cost Math
Rough numbers for a typical distillation:
- Generation cost. Running a frontier API across enough inputs to fine-tune (~10K–100K examples): $1K–$50K.
- Training cost. A few hundred to a few thousand dollars on rented GPUs for LoRA on an 8B model.
- Engineering time. A few engineer-weeks for a serious effort.
- Ongoing maintenance. Re-run the pipeline when the teacher improves or the task drifts.
If your monthly API spend on this task is below $5K, distillation almost certainly isn't worth it. Above $50K, it almost certainly is. Between, it depends.
Routing as a Hybrid
You rarely have to choose just one. A common production pattern:
- The distilled model handles 90%+ of inputs cheaply and quickly.
- The API frontier model handles the harder 10%, identified by either a classifier or low-confidence detection.
- Failures from the distilled model accumulate into a queue for the next training round.
This combines the cost benefits of distillation with the capability ceiling of the frontier.
What to Watch
The distillation landscape moves with the open-weights frontier. Llama, Qwen, DeepSeek, Mistral, Gemma each release base models that are progressively better starting points. A distillation that wasn't viable a year ago often is now. Re-evaluate periodically.