The Open-Source AI Stack
RSS

Glossary

tokens per second

The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.

Tokens per second is the speed at which a model produces output. The number people care about for interactive use is decode tokens/sec: the rate at which new tokens stream after the first one appears. It is bounded by memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry divided by the bytes streamed per token, which makes it a property of the model, the quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry , the context length, and the hardware together, not of any one alone.

A single tokens/sec figure on a screenshot is rarely comparable, because it hides the workload. Single-stream decode (one user) is different from batched server throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry (many concurrent users summed), often by more than ten times; prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry and decode are different again; and different tokenizers cut the same text into different token counts. A useful number records the model, quant, runtime, context, and concurrency.

This site computes a theoretical ceiling and a realistic range for decode, and overlays measured anchors where a sourced single-stream number exists, so the reader can see estimate against reality rather than trusting one figure out of context.

Sources

Back to glossary