Dropout and Regularization Techniques

## TL;DR
Regularization prevents neural networks from memorizing training data instead of learning generalizable patterns. Dropout, weight decay, data augmentation, and early stopping are the primary techniques.

## Core Explanation
Dropout works as implicit ensemble: with n neurons and dropout probability p, 2^n possible network configurations are sampled during training. This prevents co-adaptation where neurons rely on specific other neurons. Modern architectures often prefer batch/layer normalization over dropout as the primary regularizer.

## Detailed Analysis
Data augmentation (rotation, cropping, color jitter for images; back-translation, word dropout for text) is the most effective regularizer because it directly expands the training distribution. Mixup (Zhang et al., 2018) creates virtual training examples by convex combinations of pairs.

## Further Reading
- Goodfellow et al., Deep Learning, Ch.7
- CS231n: Regularization Notes
- Distill.pub: Visualizing Regularization