LLM Inference Optimization: From FlashAttention to Speculative Decoding

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR

LLM inference optimization reduces memory use, latency, and serving cost for transformer models. This repaired entry focuses on three well-cited techniques: PagedAttention, FlashAttention, and LLM.int8().

## Core Explanation

PagedAttention targets key-value cache fragmentation, FlashAttention targets attention memory traffic, and LLM.int8() targets matrix multiplication precision. The prior future tutorial source was removed from public evidence.

## Further Reading

- [PagedAttention / vLLM](https://arxiv.org/abs/2309.06180)
- [FlashAttention](https://arxiv.org/abs/2205.14135)
- [LLM.int8()](https://arxiv.org/abs/2208.07339)

## Related Articles

- [Large Language Model Training](../large-language-model-training-scaling-laws-data-curation-and-compute.md)
- [Efficient Green AI](../efficient-green-ai.md)