Glossary

FlashAttention

An exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss.

Runtime also: Silicon aka flash attention

A reimplementation of attention that exploits the speed difference between GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry SRAM (on-chip, fast, small) and HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry (off-chip, slow, large). The standard attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry algorithm materializes a sequence_length squared attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry matrix in HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry , then reads it back to compute the output; FlashAttention tiles the computation so the full matrix never leaves SRAM, eliminating most of the slow memory traffic.

The result is exact attention (same output as the naive algorithm) at 2 to 4 times the speed, with linear instead of quadratic memory in sequence length. FlashAttention-2 (2023) restructured the loop order for better GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry utilization; FlashAttention-3 (2024) added Hopper- specific tensor-core optimizations.

It is now the default kernel in every major training and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry stack: PyTorch’s scaled-dot-product-attention, vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , TensorRT- LLM, and the Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry Transformers library all route attention through a FlashAttention implementation when available.

Sources

Mentioned in

Back to glossary