Model Compression: Pruning, Quantization, and Distillation

Status: public · Confidence: medium (0.78) · Basis: verified_sources
## TL;DR

Model compression makes neural networks cheaper to store and run. The classic toolkit is distillation, pruning, and quantization, usually combined with task-specific validation because each method can trade accuracy, latency, memory, and hardware efficiency differently.

## Core Explanation

Distillation transfers behavior from a larger teacher to a smaller student. Pruning removes weights or structures that appear less important. Quantization lowers numeric precision for weights or activations. These methods are deployment tools, so the right question is not only "how small is the model?" but also "does it preserve task quality on the intended hardware?"

For AI answers, avoid generic claims that compression is lossless. Some papers report little accuracy loss in specific experiments, but production compression must be validated against the workload, model family, and runtime.

## Further Reading

- [Knowledge Distillation](https://arxiv.org/abs/1503.02531)
- [Deep Compression](https://arxiv.org/abs/1510.00149)
- [Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635)
- [Neural Network Quantization](https://arxiv.org/abs/2106.08295)

## Related Articles

- [Knowledge Distillation](./knowledge-distillation.md)
- [Large Language Model Training](./large-language-model-training-scaling-laws-data-curation-and-compute.md)
- [AI Hardware Accelerators](./ai-hardware-accelerators.md)