Distillation is how you turn a model that is too expensive to run into one you can actually afford. The setup is a teacher and a student. The teacher is a large, capable model; the student is a smaller one. The teacher produces training targets, and the student learns to reproduce them. When it works, you end up with a small model that behaves like the big one on the task you care about, at a fraction of the cost and latency.

Teacher to student

The mechanics are an imitation loop. You run a set of inputs through the teacher and collect its outputs. Then you train the student on those input/output pairs, so it learns to produce what the teacher produced. The student is not learning from the ground truth directly; it is learning from the teacher's behavior, which is often richer than a bare label because it captures how the teacher handles ambiguous and edge cases.

In practice this looks a lot like building a fine-tuning dataset, except the teacher writes the answers instead of a human. That connection is worth leaning on: the same discipline from building-a-fine-tuning-dataset applies. The teacher's outputs are your data, so the quality of the inputs you feed it and the care you take filtering its answers decide how good the student gets. A teacher will occasionally produce a wrong or sloppy answer; if you train on those uncritically, the student inherits them.

What you get and what you give up

The win is a model that is cheaper to serve, faster to respond, and small enough to run in places the teacher never could, while staying close to the teacher on the narrow task it was distilled for. The cost is generality. The student is specialized; it learned to imitate the teacher on your task distribution, not to reproduce the teacher's full range. Ask it something outside that distribution and it will not have the teacher's depth, because that depth is exactly what you traded away for size and speed.

Distillation is not quantization

These two get confused because both produce a cheaper model, but they are different operations.

  • Quantization keeps the same model and stores its weights at lower precision (for example 16-bit floats down to 4-bit). The architecture and the number of parameters do not change; you are compressing the existing weights, with a small quality cost. See quantization-explained-for-llms for the details.
  • Distillation produces a genuinely different, smaller model that was trained to imitate a larger one. The student has its own, smaller architecture and its own weights.

They compose. A common deployment path is to distill a large teacher into a smaller student, then quantize that student to squeeze it onto the target hardware. One step changes which model you have; the other changes how that model is stored.

When it is worth it

The baseline to beat is always the same: take the small model you would have distilled into and just use it directly on your task. Benchmark that first. Distillation earns its keep only in the gap between two facts being true at once: the small model used directly is not good enough, and the teacher is too slow or too expensive to run at your scale. When both hold, distillation gives you most of the teacher's quality at the student's price, and that is a real win.

When they do not both hold, skip it. If a small off-the-shelf model already handles the task, you are doing extra work for nothing. If you can afford to run the teacher directly, run the teacher. Distillation is a specialization move for the specific case where a good small model does not exist yet for your task and a good large one is out of budget.