Glossary

tokens per second

The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.

Runtime also: Silicon also: Evaluation aka tok/s, tps, tokens/sec

Tokens per second is the speed at which a model produces output. The number people care about for interactive use is decode tokens/sec: the rate at which new tokens stream after the first one appears. It is bounded by memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry divided by the bytes streamed per token, which makes it a property of the model, the quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry , the context length, and the hardware together, not of any one alone.

A single tokens/sec figure on a screenshot is rarely comparable, because it hides the workload. Single-stream decode (one user) is different from batched server throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry (many concurrent users summed), often by more than ten times; prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry and decode are different again; and different tokenizers cut the same text into different token counts. A useful number records the model, quant, runtime, context, and concurrency.

This site computes a theoretical ceiling and a realistic range for decode, and overlays measured anchors where a sourced single-stream number exists, so the reader can see estimate against reality rather than trusting one figure out of context.

Sources

Memory Bandwidth for Local AI Hardware (2026 Edition), Ahmad Osman

Back to glossary