Glossary
SGLang
An open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads.
A peer-competitive open inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
Open full entry engine launched in 2024 by the LMSYS
team (the same group behind Chatbot Arena and Vicuna). The defining
technical features are RadixAttention for prefix-sharing across
requests and a Python frontend for structured generation
(constrained decoding, parallel forking, control-flow over multiple
generations).
The combination matches agent-shaped traffic well: many requests reusing long system prompts and tool descriptions, with shared cache state across requests. On agent benchmarks SGLang typically matches or beats vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry ; on pure single-turn chat the engines run close.
Full coverage at /projects/sglang.