Glossary

SGLang

An open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads.

Runtime also: Agents

A peer-competitive open inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry engine launched in 2024 by the LMSYS team (the same group behind Chatbot Arena and Vicuna). The defining technical features are RadixAttention for prefix-sharing across requests and a Python frontend for structured generation (constrained decoding, parallel forking, control-flow over multiple generations).

The combination matches agent-shaped traffic well: many requests reusing long system prompts and tool descriptions, with shared cache state across requests. On agent benchmarks SGLang typically matches or beats vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry ; on pure single-turn chat the engines run close.

Full coverage at /projects/sglang.

Sources

SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2024)

Mentioned in

Back to glossary