# Attention Mechanisms: Scaled Dot-Product to FlashAttention Status: public Confidence: medium (0.8) (verified) Last verified: 2026-05-24 Generation: ai_structured ## TL;DR Attention mechanisms enable neural networks to dynamically focus on relevant parts of input sequences. Since Vaswani et al. (2017), attention has become the dominant paradigm in NLP, vision, and multimodal AI. ## Core Explanation Self-attention computes weighted representations: each token attends to all others, with weights determined by pairwise similarity. Multi-head attention runs multiple attention operations in parallel, capturing different relationship types. Attention's quadratic complexity O(n²) drives ongoing efficiency research. ## Detailed Analysis FlashAttention accelerates by minimizing HBM reads/writes through tiling and recomputation. Sparse attention (Sparse Transformers, Big Bird) uses patterns like sliding windows or random attention. Linear attention (Linformer, Performer) approximates full attention for linear complexity. ## Further Reading - The Illustrated Transformer (Jay Alammar) - FlashAttention GitHub - Lilian Weng: Attention? Attention! ## Related Articles - [Attention Mechanism](../attention-mechanism.md) - [Attention vs. Self-Attention](../attention-vs-self-attention.md) - [LLM Inference Optimization: From FlashAttention to Speculative Decoding](../llm-inference-optimization.md)