## TL;DR
Batch Normalization (BN) normalizes layer inputs to zero mean and unit variance within each mini-batch, then scales and shifts with learnable parameters. Benefits: faster training (higher learning rates), reduces sensitivity to initialization, acts as regularizer (reduces need for dropout). BN is standard in most CNN architectures.
## Core Explanation
BN computes μ and σ² per mini-batch, normalizes, then applies γ*normalized + β (learnable). At inference: use running averages of μ and σ² (not batch statistics). Internal Covariate Shift: the phenomenon BN was designed to address (distribution of layer inputs changes during training). Layer Normalization (LN, used in Transformers) normalizes across features, not batch — works for variable batch sizes and RNNs.
## Further Reading
- [Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy, 2015)](https://arxiv.org/abs/1502.03167)