Glossary

latency

The time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.

Compute also: Runtime also: Evaluation aka time to first token, TTFT, TPOT

The user-perceived wait. For LLMs the standard breakdown is time-to-first-token (TTFT, dominated by prefill) and time-per-output- token (TPOT, dominated by decode). End-to-end latency is TTFT plus TPOT times output length.

Latency budgets shape architecture choices. A chat UI needs TTFT under half a second to feel responsive; an agent loop tolerates seconds. An embedded code-completion needs full sub-100ms end-to-end; a long-form research agent can run for minutes.

The tunable levers are prefill chunkingretrieval-memorySplitting source documents into smaller passages for embedding and retrieval, where the chunk size and overlap directly affect retrieval quality and context efficiency. Open full entry (caps TTFT regardless of prompt length), prefix cachingruntimeA serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute. Open full entry (reuses prior prefill state), speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry (accelerates TPOT), model size (smaller models are faster end-to-end), and quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry (smaller weights move faster). Production serving usually defines latency SLOs first and then optimizes throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry under those bounds.

Sources

Anyscale: LLM inference performance engineering

Mentioned in

Back to glossary