Glossary

arithmetic intensity

FLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity.

Runtime also: Silicon aka operational intensity, ops per byte

Arithmetic intensity is the number of floating-point operations an operation performs for each byte it moves from memory. It is the x-axis of the rooflineruntimeA performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity. Open full entry model and the cleanest way to predict whether a workload will be limited by compute or by memory bandwidth.

A matrix-vector product, which is what decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry does at batch size one, reads each weight once and performs roughly two operations on it (a multiply and an add), giving an intensity near two operations per byte. Modern accelerators can do hundreds of operations per byte of bandwidth, so decode leaves the compute units idle and is bound by how fast weights stream from HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry or unified memorysiliconA single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark. Open full entry .

Intensity rises with batch size and during prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry , because the same weights get reused across many tokens, which is why batched serving and prompt processing can saturate compute while single-stream decode cannot. Raising intensity (batching, speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry ) is the main lever for getting more tokens/sec out of a memory-bound box.

Sources

Transformer Inference Arithmetic (kipperrii)

Mentioned in

roofline

Back to glossary