Glossary

model FLOPs utilization

MFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode.

Runtime also: Silicon also: Training aka mfu

Model FLOPs utilization is the ratio of the compute a workload actually sustains to the hardware’s peak FLOPS. It is the compute-bound counterpart to model bandwidth utilizationruntimeMBU is the fraction of an accelerator's peak memory bandwidth a serving stack actually reaches during decode. Real systems land around 60 to 85 percent. Open full entry , and it governs the phases of LLM work that saturate the matrix-multiply units rather than the memory bus.

prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry and training are compute-bound, so MFU is the right efficiency metric for them. A rough prefill time-to-first-token estimate divides the prompt’s compute (about two operations per parameter per token) by the chip’s dense FLOPS times an MFU factor; real MFU varies more than MBU because it is sensitive to sequence length, kernel fusion, and how well the batch packs the compute units.

MFU does not describe single-stream decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry , which is memory-bound and stays well below peak compute no matter how good the kernels are. Confusing the two is a common source of wrong hardware conclusions: a chip with enormous FLOPS but modest bandwidth will look great on a compute-bound benchmark and disappoint on interactive token generation.

Sources

LLM Inference Performance Engineering: Best Practices (Databricks)

Mentioned in

model bandwidth utilization

Back to glossary