Transformer Architecture

## TL;DR

The Transformer is a neural network architecture based solely on attention mechanisms, introduced by Vaswani et al. from Google Brain in the 2017 NeurIPS paper "Attention Is All You Need." It replaced recurrence and convolution with multi-head self-attention, enabling fully parallel computation across input sequences. The base model achieved 28.4 BLEU on WMT 2014 English-to-German translation, setting a new state-of-the-art while requiring only 3.5 days of training on 8 NVIDIA P100 GPUs — a fraction of the training time of previous approaches. As of May 2026, the paper has been cited over 140,000 times and is the foundation of virtually all modern language models (BERT, GPT, LLaMA, Claude, Gemini).

## Core Explanation

The Transformer processes sequences through stacked encoder and decoder layers. Unlike RNNs (LSTM, GRU), which process tokens sequentially (O(n) sequential operations), the Transformer computes relationships between all positions simultaneously through self-attention, achieving constant parallel time complexity per layer.

### Self-Attention Mechanism

Scaled Dot-Product Attention is the core operation:

```
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
```

- **Q (Query)** : What this position is looking for
- **K (Key)** : What this position offers as a match
- **V (Value)** : The actual information to aggregate
- **√dₖ divider**: Prevents softmax saturation (dₖ = key dimension, typically 64)

Multi-head attention runs this computation in parallel across h=8 different learned projections (subspaces), enabling the model to attend to different types of relationships simultaneously (syntactic, semantic, positional, etc.).

## Detailed Analysis

### Architecture Specifications

| Component | Base Model | Big Model |
|-----------|:----------:|:---------:|
| Encoder layers | 6 | 6 |
| Decoder layers | 6 | 6 |
| Model dimension (d_model) | 512 | 1024 |
| Attention heads (h) | 8 | 16 |
| Feed-forward dimension (d_ff) | 2048 | 4096 |
| Total parameters | 65M | 213M |
| Training time (WMT EN-DE) | 3.5 days | 3.5 days |
| Hardware | 8 × P100 | 8 × P100 |

### Key Architectural Components

**Positional Encoding**: Since the Transformer has no recurrence and processes all positions simultaneously, it requires explicit position information. The original paper uses fixed sinusoidal encodings:

```
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
```

**Encoder**: 6 identical layers, each consisting of:
1. Multi-head self-attention (all positions attend to all positions)
2. Residual connection + layer normalization
3. Position-wise feed-forward network (FFN: two linear transformations with ReLU: max(0, xW₁ + b₁)W₂ + b₂)
4. Residual connection + layer normalization

**Decoder**: 6 identical layers, each with three sub-layers:
1. Masked multi-head self-attention (prevents attending to future positions during autoregressive generation)
2. Cross-attention to encoder output (Q from decoder, K, V from encoder)
3. Position-wise feed-forward network with residual connections

### Performance and Impact

| Metric | Value | Verification |
|--------|-------|-------------|
| WMT 2014 EN-DE BLEU | 28.4 (base), 28.5 (big at ensemble size 8) | Original paper Table 2 |
| WMT 2014 EN-FR BLEU | 41.0 (big, single model) | Original paper Table 2 |
| Training speed vs. RNN | 10-100× faster wall-clock time | Original paper: 3.5 days vs. weeks |
| Google Scholar citations (May 2026) | 140,000+ | Verified via multiple citation databases |
| HuggingFace ecosystem (Spring 2026) | 2M+ public models total | HuggingFace State of Open Source Report |

### Why Transformer Surpassed RNNs

RNNs (including LSTMs and GRUs) process tokens sequentially: the computation at position t cannot begin until position t-1 is complete. This creates a fundamental bottleneck:
1. **Training inefficiency**: Cannot parallelize across sequence positions
2. **Gradient problems**: Long sequences suffer vanishing/exploding gradients
3. **Memory limitations**: Must maintain hidden state for entire sequence

The Transformer eliminated all three by computing attention between every position pair simultaneously. The only sequential dependency is the number of layers (typically 6-96), not the sequence length.

### Derivative Architecture Family

The Transformer spawned four major architecture families:

| Family | Architecture | Example Models |
|--------|-------------|---------------|
| **Encoder-only** | Bidirectional self-attention | BERT, RoBERTa, DeBERTa |
| **Decoder-only** | Autoregressive masked self-attention | GPT-1/2/3/4, LLaMA, Claude |
| **Encoder-Decoder** | Full Transformer | T5, BART, original NMT |
| **Vision** | Patch-based self-attention | ViT, DINO, DeiT |

## Further Reading

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762): Original paper (140K+ citations)
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/): Visual walkthrough by Jay Alammar
- [CS224n: Transformers Lecture](https://web.stanford.edu/class/cs224n/): Stanford NLP course