The Open-Source AI Stack
RSS

Glossary

throughput

The rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.

The aggregate rate of token output across all concurrent requests on a given hardware unit. A vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry server might do 5000 tokens/sec on an H100 at 70B LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry with FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry weights; an MI300X might do 4000 tokens/sec on the same model; an OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine. Open full entry instance on a Mac Studio does 50 tokens/ sec for a single user.

Throughput trades against latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry . Larger batches push throughput up at the cost of higher time-to-first-token and time-per-token for the individual request. Production serving usually picks a target latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry budget and then tunes batchingcomputeGrouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput. Open full entry , quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry , and prefix cachingruntimeA serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute. Open full entry to maximize throughput under that budget.

The relevant denominator depends on the question. Per-GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry throughput is the unit of hardware planning. Aggregate cluster throughput is the unit of capacity planning. Tokens per dollar is the unit of business planning. The same model can rank differently on each.

Sources

Mentioned in

Back to glossary