Language Modeling: From N-grams to Scaling Laws and Information-Theoretic Foundations

## TL;DR
Language modeling -- the simple task of predicting what word comes next -- is the improbable foundation behind GPT-4 and all modern LLMs. From Claude Shannon's 1948 information theory to scaling laws governing billion-parameter models, the mathematics of prediction unites statistical approaches, neural networks, and the emergent intelligence of large-scale pretraining.

## Core Explanation
The language modeling objective: given context tokens x1, x2, ..., x_{t-1}, predict P(x_t | x< t). Loss: negative log-likelihood = -log P(x_t | x< t). The exponential of average loss is perplexity. Evolution: (1) N-gram models (pre-2013) -- count-based sparse models; (2) Neural LMs (2013-2017) -- RNNs learning continuous word representations; (3) Transformer LMs (2017-present) -- GPT, BERT (masked LM), T5. Self-attention captures long-range dependencies, scales to trillion-token training.

## Detailed Analysis
Scaling laws (Kaplan et al., 2020): varied model size (768 to 1.5B params), dataset size, and compute. Key findings: (1) L(N) follows power law in model size; (2) Data scaling follows similar power law; (3) Larger models are more sample-efficient. Chinchilla (Hoffmann et al., 2022): the correction -- optimal training uses 20x more tokens than parameters. Chinchilla-70B matches Gopher-280B (4x larger). Information-theoretic view: language modeling IS compression. A perfect LM would achieve entropy of English (~1.0 bits/character). Current SOTA: ~0.8-1.0 bits/character. gzip + kNN beats BERT on text classification (Jiang et al., ACL 2023) -- demonstrating simple compressors capture linguistic structure. Key open question: why does next-token prediction lead to chain-of-thought reasoning and in-context learning? The theoretical understanding of emergent abilities remains incomplete.