# Distributed Training: FSDP, DeepSpeed, and Scaling Laws
Status: public
Confidence: medium (0.78) (verified)
Last verified: 2026-05-28
Generation: ai_structured


## TL;DR

Distributed training splits work across accelerators to train models that do not fit efficiently on a single device. This repair removes the unsupported GPT-4 training claim and keeps the evidence to Megatron-LM, ZeRO, and GPipe.

## Core Explanation

Megatron-LM covers tensor/model parallelism, ZeRO covers sharded memory optimization, and GPipe covers pipeline parallelism. Together they show the main families of distributed deep-learning training techniques without relying on unverified proprietary details.

## Further Reading

- [Megatron-LM](https://arxiv.org/abs/1909.08053)
- [ZeRO](https://arxiv.org/abs/1910.02054)
- [GPipe](https://arxiv.org/abs/1811.06965)

## Related Articles

- [Large Language Model Training](../large-language-model-training-scaling-laws-data-curation-and-compute.md)
- [MLOps and LLMOps](../mlops-llmops.md)