Optimization Algorithms for Deep Learning

## TL;DR
Optimization algorithms update neural network weights to minimize the loss function. Adam dominates practice, but SGD with momentum remains competitive when well-tuned.

## Core Explanation
Gradient descent computes the average gradient over training data to update weights. SGD uses single examples or mini-batches for faster, noisier updates that often generalize better. Momentum accumulates past gradients to smooth updates and escape local minima. Adam adapts learning rates per-parameter.

## Detailed Analysis
AdamW (Loshchilov & Hutter, 2019) fixes Adam's weight decay implementation, making it the recommended variant. Learning rate warmup — gradually increasing from a small initial rate — prevents training instability in transformers. Gradient clipping prevents exploding gradients.

## Further Reading
- Ruder: An Overview of Gradient Descent Algorithms
- Lilian Weng: Optimization for Deep Learning
- PyTorch: torch.optim Documentation