The Open-Source AI Stack
RSS

Glossary

roofline

A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.

The roofline model plots achievable throughput against arithmetic intensityruntimeFLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity. Open full entry (the ratio of compute to memory traffic). An operation with low intensity is bounded by how fast memory can be read; an operation with high intensity is bounded by how fast the compute units can run. The two regions meet at a ridge point, and where a workload lands tells you which resource to optimize.

LLM decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry sits firmly on the memory-bandwidth side of the roofline. Generating one token reads the active weights once while doing only about two floating-point operations per byte read, so the compute units mostly wait on memory. That is why the theoretical decode tokens/sec ceiling is memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry divided by the bytes streamed per token, and why a card’s bandwidth, not its FLOPS, predicts how fast it generates text.

prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry is the opposite case: it processes the whole prompt in parallel, reuses each weight across many tokens, and pushes the compute units toward saturation, so it sits on the compute-bound side. A single accelerator therefore has two different effective ceilings depending on which phase it is running.

Sources

Mentioned in

Back to glossary