Glossary

roofline

A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.

Runtime also: Silicon also: Compute aka roofline model

The roofline model plots achievable throughput against arithmetic intensityruntimeFLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity. Open full entry (the ratio of compute to memory traffic). An operation with low intensity is bounded by how fast memory can be read; an operation with high intensity is bounded by how fast the compute units can run. The two regions meet at a ridge point, and where a workload lands tells you which resource to optimize.

LLM decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry sits firmly on the memory-bandwidth side of the roofline. Generating one token reads the active weights once while doing only about two floating-point operations per byte read, so the compute units mostly wait on memory. That is why the theoretical decode tokens/sec ceiling is memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry divided by the bytes streamed per token, and why a card’s bandwidth, not its FLOPS, predicts how fast it generates text.

prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry is the opposite case: it processes the whole prompt in parallel, reuses each weight across many tokens, and pushes the compute units toward saturation, so it sits on the compute-bound side. A single accelerator therefore has two different effective ceilings depending on which phase it is running.

Sources

Mentioned in

Back to glossary