The Open-Source AI Stack
RSS

Glossary

RadixAttention

A KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state.

Runtime aka radix attention

The KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry scheme that distinguishes SGLang from peer runtimes. Where PagedAttention treats the cache as a flat pool of fixed-size blocks, RadixAttention organizes blocks into a radix tree keyed by token prefix. Requests sharing a prefix automatically share the corresponding cache nodes, and least-recently-used eviction reclaims unreferenced branches.

The use case is agent-shaped traffic: many requests reusing the same long system prompt, the same few-shot examples, or the same tool-call context. Prefix sharing turns these into near-free reuses of already- computed attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry state, with throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry gains of 2x to 5x for prefix-heavy workloads compared to no sharing.

SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry pairs RadixAttention with a domain-specific frontend for structured output (constrained decoding, parallel forking) which is the combination it markets. For pure single-turn chat traffic the gains are smaller; for agent traffic they can be the dominant performance lever.

Sources

Mentioned in

Back to glossary