Model Distillation Explained

Article summary

Distillation trains a small student model to mimic a large teacher: the teacher generates the training targets, and the student learns to reproduce them on a narrow task. The payoff is a cheaper, faster model that gets close to the teacher where it matters. It is different from quantization, which shrinks the same model by lowering weight precision rather than training a new smaller one. Distillation is worth the effort when a small model used directly is not good enough but the teacher is too slow or expensive to run at scale.

Distillation is how you turn a model that is too expensive to run into one you can actually afford. The setup is a teacher and a student. The teacher is a large, capable model; the student is a smaller one. The teacher produces training targets, and the student learns to reproduce them. When it works, you end up with a small model that behaves like the big one on the task you care about, at a fraction of the cost and latency.

Teacher to student

The mechanics are an imitation loop. You run a set of inputs through the teacher and collect its outputs. Then you train the student on those input/output pairs, so it learns to produce what the teacher produced. The student is not learning from the ground truth directly; it is learning from the teacher's behavior, which is often richer than a bare label because it captures how the teacher handles ambiguous and edge cases.

In practice this looks a lot like building a fine-tuning dataset, except the teacher writes the answers instead of a human. That connection is worth leaning on: the same discipline from building-a-fine-tuning-dataset applies. The teacher's outputs are your data, so the quality of the inputs you feed it and the care you take filtering its answers decide how good the student gets. A teacher will occasionally produce a wrong or sloppy answer; if you train on those uncritically, the student inherits them.

What you get and what you give up

The win is a model that is cheaper to serve, faster to respond, and small enough to run in places the teacher never could, while staying close to the teacher on the narrow task it was distilled for. The cost is generality. The student is specialized; it learned to imitate the teacher on your task distribution, not to reproduce the teacher's full range. Ask it something outside that distribution and it will not have the teacher's depth, because that depth is exactly what you traded away for size and speed.

Distillation is not quantization

These two get confused because both produce a cheaper model, but they are different operations.

Quantization keeps the same model and stores its weights at lower precision (for example 16-bit floats down to 4-bit). The architecture and the number of parameters do not change; you are compressing the existing weights, with a small quality cost. See quantization-explained-for-llms for the details.
Distillation produces a genuinely different, smaller model that was trained to imitate a larger one. The student has its own, smaller architecture and its own weights.

They compose. A common deployment path is to distill a large teacher into a smaller student, then quantize that student to squeeze it onto the target hardware. One step changes which model you have; the other changes how that model is stored.

When it is worth it

The baseline to beat is always the same: take the small model you would have distilled into and just use it directly on your task. Benchmark that first. Distillation earns its keep only in the gap between two facts being true at once: the small model used directly is not good enough, and the teacher is too slow or too expensive to run at your scale. When both hold, distillation gives you most of the teacher's quality at the student's price, and that is a real win.

When they do not both hold, skip it. If a small off-the-shelf model already handles the task, you are doing extra work for nothing. If you can afford to run the teacher directly, run the teacher. Distillation is a specialization move for the specific case where a good small model does not exist yet for your task and a good large one is out of budget.

Frequently asked questions

How is distillation different from quantization?

They both make a model cheaper to run, but in different ways. Quantization takes one model and stores its weights at lower precision, so it is the same model, just smaller and faster, with a small quality cost. Distillation produces a genuinely different, smaller model, trained from scratch or fine-tuned to imitate a larger teacher. You can even do both: distill to a smaller architecture, then quantize that student for deployment.

Why not just use the small model directly?

Sometimes you should, and that is the honest baseline to beat. Distillation is only worth the trouble when the small model used directly is not good enough on your task but a large model is, and you cannot afford to run the large one at scale. If a small model already handles the task well off the shelf, distillation is extra work for no gain. Always benchmark the small model directly first.

Does the student ever beat the teacher?

On the narrow task it was distilled for, a student can match the teacher closely and occasionally edge it on consistency, because it is specialized rather than general. It will not beat the teacher across the broad range of things the teacher can do; that breadth is exactly what you gave up to get a cheaper, faster model. Distillation trades generality for efficiency on a specific job.