Optimization Algorithms for Deep Learning: SGD, Adam, AdamW, and Lion

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR

Optimizers are the update rules that turn gradients into model-parameter changes. For agent builders, optimizer details matter when training or fine-tuning models, but they also provide a useful general lesson: performance claims are meaningless without the learning rate schedule, batch size, model size, and evaluation setup.

## Core Explanation

Stochastic gradient descent updates parameters using gradients from mini-batches. Momentum smooths noisy updates by carrying part of the previous update forward. Adam adds adaptive per-parameter scaling based on moving averages of gradients and squared gradients. AdamW separates weight decay from the adaptive gradient update, which is why it is common in Transformer training recipes.

Learning-rate schedules can matter as much as the optimizer choice. Warmup can stabilize early training, cosine decay can reduce the learning rate smoothly, and restarts can periodically raise it again. A change in optimizer should be evaluated with its schedule, regularization, and batch-size assumptions, not as an isolated name.

## Agent Notes

- When comparing training runs, log optimizer, learning rate, schedule, batch size, weight decay, gradient clipping, and random seed.
- For fine-tuning language models, start from the model provider's recommended optimizer settings before changing many variables at once.
- Treat optimizer changes as experiments with an eval plan, not as guaranteed improvements.
- If an agent proposes a training recipe, require reproducible configuration before accepting the result.

## Related Articles

- [Gradient Descent: The Workhorse of Machine Learning Optimization](../gradient-descent.md)
- [Transformer Architecture](../transformer.md)
- [Efficient and Green AI: Measuring Cost, Energy, and Deployment Tradeoffs](../efficient-green-ai.md)