Attention vs. Self-Attention

Status: public · Confidence: medium (0.8) · Basis: verified_sources

## TL;DR

Attention mechanisms relate one set of sequence states to another, while self-attention relates positions within the same sequence.

## Core Explanation

The key distinction is whether attention relates decoder state to encoder state, as in early neural machine translation, or relates positions inside the same sequence.

## Source-Mapped Facts

- Bahdanau attention lets a neural machine translation model learn to align and translate by searching for relevant source-sentence parts while generating a target word. ([source](https://arxiv.org/abs/1409.0473))
- The Transformer replaces recurrence and convolution with attention mechanisms, including self-attention over sequence positions. ([source](https://arxiv.org/abs/1706.03762))
- Efficient Transformers surveys Transformer variants that target computational and memory efficiency limitations of attention. ([source](https://doi.org/10.1145/3530811))

## Further Reading

- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Efficient Transformers: A Survey](https://doi.org/10.1145/3530811)