Glossary
model bandwidth utilization
MBU is the fraction of an accelerator's peak memory bandwidth a serving stack actually reaches during decode. Real systems land around 60 to 85 percent.
Model bandwidth utilization is the ratio of the memory bandwidth a deployment actually achieves to the hardware’s peak. It is the honest derating between the theoretical decode rooflineruntimeA performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity. Open full entry ceiling and what a real runtime delivers. A box rated at 1.8 TB/s does not turn all of it into tokens; scheduler overhead, dequantization, kernel launch latency, and imperfect memory access patterns each take a cut.
Well-optimized CUDA stacks (vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , TensorRT-LLM) commonly run in the 70 to 85 percent range; portable runtimes like GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry and Apple’s MLX run lower, and AI-PC parts lower still. The exact band is a rule of thumb rather than a published constant, which is why this site shows a realistic range and overlays measured numbers rather than a single false-precise figure.
MBU is the bandwidth-side analogue of model FLOPs utilizationruntimeMFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode. Open full entry , which plays the same role for compute-bound work. Tracking MBU is the right way to compare serving stacks on the same hardware: a higher MBU at equal peak bandwidth means more tokens/sec for the same box.