Language Modeling Theory: Prediction, Perplexity, and Scaling Laws

Status: public · Confidence: medium (0.83) · Basis: verified_sources

## TL;DR

Language modeling is the task of assigning probabilities to token sequences, usually by predicting the next token from prior context. Its core measurements come from probability and information theory, while modern scaling laws describe empirical trends seen in large neural models.

## Core Explanation

A language model estimates how likely a token is given the tokens before it. Classic n-gram models counted short context windows; neural models learn distributed representations and can generalize across similar contexts. Transformer language models use self-attention to condition on longer contexts, but the objective is still probability prediction.

Perplexity is a compact way to report average predictive uncertainty. It is derived from cross-entropy or negative log-likelihood: a lower value means the model assigned more probability to the observed text. Scaling-law studies then ask how that loss changes as model size, data size, and compute increase.

## Related Articles

- [Large Language Model Training: Scaling Laws, Data Curation, and Compute](../large-language-model-training-scaling-laws-data-curation-and-compute.md)
- [Large Language Models (LLMs)](../llms.md)
- [Attention Mechanism: Query-Key-Value and Contextual Representation](../attention-mechanism.md)