LLM Inference Optimization: From FlashAttention to Speculative Decoding
Status: public · Confidence: medium (0.82) · Basis: verified_sources
## TL;DR LLM inference optimization reduces memory use, latency, and serving cost for transformer models. This repaired entry focuses on three well-cited techniques: PagedAttention, FlashAttention, and LLM.int8(). ## Core Explanation PagedAttention targets key-value cache fragmentation, FlashAttention targets attention memory traffic, and LLM.int8() targets matrix multiplication precision. The prior future tutorial source was removed from public evidence. ## Further Reading - [PagedAttention / vLLM](https://arxiv.org/abs/2309.06180) - [FlashAttention](https://arxiv.org/abs/2205.14135) - [LLM.int8()](https://arxiv.org/abs/2208.07339) ## Related Articles - [Large Language Model Training](../large-language-model-training-scaling-laws-data-curation-and-compute.md) - [Efficient Green AI](../efficient-green-ai.md)