# Mixture of Experts (MoE)
Status: public
Confidence: medium (0.855) (verified)
Last verified: 2026-05-30
Generation: ai_structured


## TL;DR

Mixture of Experts, or MoE, is a sparse neural-network architecture pattern in which a router sends each token or example to a subset of expert networks. The point is to increase total model capacity without evaluating every parameter for every token.

## Core Claims

Sparsely-gated MoE layers use learned routing so only selected experts process a given input. This separates total parameter count from active computation.

Switch Transformers simplify routing by selecting one expert per token. GLaM and Mixtral show MoE designs applied to large language models, with sparse activation making only a subset of parameters active for each token.

The main engineering challenge is routing quality. Practical MoE systems must manage load balancing, expert capacity, communication overhead, and expert specialization.

## Citation Boundaries

Use this article for stable MoE architecture concepts. Do not use it for unverified claims about private model parameter counts, current frontier-model internals, or product-specific serving costs.

## Further Reading

- [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538)
- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
- [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905)
- [Mixtral of Experts](https://arxiv.org/abs/2401.04088)