Glossary

knowledge distillation

A training technique where a small student model learns to mimic a larger teacher model's output distributions, transferring capability into a cheaper-to-serve form.

Training also: Weights aka distillation

A compression and capability-transfer technique. A small student model is trained on the soft output distributions of a larger teacher model, not just the hard labels in the original dataset. The richer supervision signal lets the student recover much of the teacher’s behavior with a fraction of the parameters and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry cost.

Two patterns common in 2026. Response distillation: the student trains on responses sampled from the teacher (often via API), suitable when only API access exists. Logit distillation: the student trains on the teacher’s full output distribution, possible only with weight-level access to the teacher.

Distillation underpins much of the small-open-model wave: many capable sub-3B open models trace back to a larger teacher’s outputs. The legal question of whether API-sourced distillation violates terms of service sits unresolved across most provider contracts in 2026.

Sources

Back to glossary