Glossary

speculative decoding

An inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss.

Runtime aka speculative sampling

A two-model trick for exact-output acceleration of autoregressive generation. A small “draft” model produces a short sequence of candidate tokens greedily or by sampling. The large “target” model then evaluates that whole candidate sequence in one forward pass, accepting tokens where its own distribution agrees and rejecting from the first disagreement. Because the target’s expensive forward pass produced a full distribution at every accepted position, decoding effectively advanced by however many tokens were accepted.

The speedup depends on draft-target agreement and runs roughly 1.5x to 3x in practice. The output is provably identical (in distribution) to sampling from the target alone, so it is a pure acceleration with no quality loss.

Variants: Medusa adds multiple prediction heads to the target itself instead of a separate draft model; EAGLE uses an autoregressive draft head conditioned on the target’s hidden states; Lookahead Decoding eliminates the draft model entirely. vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , and TensorRT- LLM all support some form of speculative decoding as of 2026.

Sources

Mentioned in

Back to glossary