Glossary
continuous batching
A request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete.
The serving pattern that made high-throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry LLM inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry economically viable. Naive batchingcomputeGrouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput. Open full entry waits for every request in a batch to finish before starting the next, which is wasteful: a 500-token completion holds the batch while a 20-token completion sits idle. Continuous batchingcomputeGrouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput. Open full entry schedules at the token level, swapping new requests in as others finish, so the GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry is never idle waiting on the longest generation.
Combined with PagedAttention (so per-request KV cache can grow
independently) and prefill chunkingretrieval-memorySplitting source documents into smaller passages for embedding and retrieval, where the chunk size and overlap directly affect retrieval quality and context efficiency.
Open full entry (so a new request’s prefill does not
stall the decode loop), continuous batching pushes a server’s effective
throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
Open full entry per GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
Open full entry 5 to 20 times higher than naive batching at similar
latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
Open full entry targets.
It is the default scheduling mode in vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry , TGIruntimeHugging Face's production inference server, an early peer of vLLM that ceded throughput leadership in 2024 and now sits in maintenance mode behind vLLM and SGLang. Open full entry , and llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry ’s server mode.