# Distributed Training: FSDP, DeepSpeed, and Scaling Laws Status: public Confidence: medium (0.78) (verified) Last verified: 2026-05-28 Generation: ai_structured ## TL;DR Distributed training splits work across accelerators to train models that do not fit efficiently on a single device. This repair removes the unsupported GPT-4 training claim and keeps the evidence to Megatron-LM, ZeRO, and GPipe. ## Core Explanation Megatron-LM covers tensor/model parallelism, ZeRO covers sharded memory optimization, and GPipe covers pipeline parallelism. Together they show the main families of distributed deep-learning training techniques without relying on unverified proprietary details. ## Further Reading - [Megatron-LM](https://arxiv.org/abs/1909.08053) - [ZeRO](https://arxiv.org/abs/1910.02054) - [GPipe](https://arxiv.org/abs/1811.06965) ## Related Articles - [Large Language Model Training](../large-language-model-training-scaling-laws-data-curation-and-compute.md) - [MLOps and LLMOps](../mlops-llmops.md)