# Mixture of Experts (MoE) Confidence: high Last verified: 2026-05-22 Generation: human_only ## TL;DR Mixture of Experts (MoE) is a neural network architecture that divides a model into multiple specialized "expert" sub-networks, with a learned gating mechanism routing each input token to only a subset of experts. This enables models with trillions of total parameters while keeping inference cost proportional to a much smaller "active" parameter count. First made practical for deep learning by Shazeer et al. (2017) at Google Brain, MoE is the dominant architecture for frontier-scale models as of 2026: GPT-4 reportedly uses MoE with approximately 1.76T total parameters but only ~280B active per forward pass. Open-source implementations (Mixtral, DeepSeek-V2) have brought MoE to the wider community. ## Core Explanation A traditional dense Transformer computes every parameter for every token. An MoE Transformer replaces some feed-forward layers with an MoE layer containing N expert FFNs plus a learned router (gate): For each token x: 1. **Gate**: g(x) = softmax(W_g · x) → probability distribution over N experts 2. **Top-k selection**: Select k experts with highest gate values (k=1 for Switch, k=2 for most MoE models) 3. **Sparse computation**: Route token x to selected experts; other experts are skipped entirely 4. **Combine**: y = Σᵢ g(x)ᵢ · Expertᵢ(x) → weighted sum of expert outputs This creates a **sparsely-activated** model: although the total parameter count is enormous, only a fraction is activated for any single token. Training and inference costs are proportional to the active parameter count, not the total. ### Load Balancing Without intervention, the router collapses to always selecting the same few experts ("rich getting richer"). Shazeer et al. (2017) introduced an **auxiliary load-balancing loss**: ``` L_balance = N · Σᵢ fᵢ · Pᵢ ``` where fᵢ = fraction of tokens routed to expert i, Pᵢ = average gate probability for expert i. This loss is minimized when all experts receive equal load (fᵢ = 1/N). It is added to the main training loss with a small coefficient (α ≈ 0.01). The Switch Transformer (Fedus et al., 2022) further introduced **expert capacity**: each expert has a fixed token capacity per batch; overflow tokens are "dropped" (passed through a residual connection instead). This ensures predictable memory usage and computational load. Token dropping also serves as a regularizer — dropped tokens still see their FNN input through the residual connection. ## Detailed Analysis ### MoE Scaling Advantage | Configuration | Total Params | Active Params | Token/Batch | Expert Computation | |--------------|:-----------:|:------------:|:-----------:|:------------------:| | Dense | 175B | 175B | All | 100% of FFN | | MoE, N=8, k=2 | 175B × 8 = 1.4T | ~350B | Each → 2 experts | ~25% of FFN | With 8 experts and top-2 gating, total parameters grow ~4× compared to a dense model with the same active parameters. The cost per token is approximately the dense cost to compute the gate + k expert FFN computations, while the model has access to k·N× the parameter capacity. ### Known MoE Deployments | Model | Total Params | Active Params | N Experts | k | Year | |-------|:-----------:|:------------:|:---------:|:---:|:----:| | Sparsely-Gated MoE | 137B | ~34B | 2048 | 2-4 | 2017 | | GShard | 600B | ~200B | 2048 | 2 | 2020 | | Switch Transformer | 1.6T | ~25B | 2048 | 1 | 2021 | | GLaM | 1.2T | ~96B | 64 | 2 | 2021 | | GPT-4 (reported) | ~1.76T | ~280B | 8 | 2 | 2023 | | Mixtral 8×7B | 46.7B | 12.9B | 8 | 2 | 2024 | | DeepSeek-V2 | 236B | 21B | 160 | 6 | 2024 | | Gemini 1.5 | Trillions | ~200B+ | Multi | MoE | 2024 | ### Router Design Choices | Approach | Description | Trade-off | |----------|------------|----------| | **Top-k Gating** | Select k experts with highest gate scores | Standard; k=2 common | | **Expert Choice** (Zhou et al., 2022) | Experts choose tokens (not vice versa) | Perfect load balance; harder to implement | | **Soft MoE** (Puigcerver et al., 2023) | Weighted combinations of all experts (no hard routing) | Simpler training; more computation | | **DeepSeekMoE** (Dai et al., 2024) | Shared experts + routed experts | Efficiency at large scale | ### Challenges | Challenge | Description | Mitigation | |-----------|------------|-----------| | **Load imbalance** | Some experts receive disproportionately more tokens | Load-balancing loss, expert capacity, expert choice routing | | **Memory overhead** | All experts must be stored in GPU memory | Expert parallelism (distribute across devices), quantization | | **Communication cost** | All-to-all collectives when experts span GPUs | Token buffering, overlapping compute and communication | | **Fine-tuning instability** | MoE harder to fine-tune than dense models | Gradual unfreezing, QLoRA, expert dropout | | **Expert collapse** | Some experts learn nothing useful | Reinitialize inactive experts, dynamic expert count | ## Further Reading - [Shazeer et al. (2017)](https://arxiv.org/abs/1701.06538): Original sparsely-gated MoE paper - [Switch Transformers](https://arxiv.org/abs/2101.03961): Scaling to trillion parameters (Fedus et al., 2022) - [Mixtral Paper](https://arxiv.org/abs/2401.04088): Open-source MoE by Mistral AI