The Open-Source AI Stack
RSS

The Stack · Runtime · Open source

SGLang

RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

Apache 2.0 · stable · Project site → · GitHub →

SGLang is an open inference engine for large language models from the LMSYS team (the same group behind Chatbot Arena). Apache 2.0. The central technical contribution is RadixAttention, a KV-cache sharing scheme that reuses computation across requests with shared prefixes (system prompts, few-shot examples, multi-turn agent conversations) by storing the cache in a radix-tree structure. SGLang also has strong primitives for structured generation (constraining output to a JSON schema, a grammar, or a regex). SGLang matters as the leading vLLM alternative on the open runtime side. Compared to vLLM (PagedAttention, broader hardware support, larger ecosystem), SGLang's distinctive wins are scenarios with heavy prompt-prefix sharing (agent loops, multi- turn chat with long system prompts) and structured-output requirements. llama.cpp is the local-first sibling for hardware you own; SGLang and vLLM target server-class GPUs. TensorRT-LLM is the closed NVIDIA-only counterpart. Production-ready. Used by xAI for Grok, by hosted-inference providers for shared-prefix workloads, and increasingly by teams running agent fleets where prefix caching pays off. The strategic position is not "displace vLLM" but "be the right engine for workloads where RadixAttention matters." Both vLLM and SGLang continue to converge on each other's strengths.

Sources

Want a follow-up? Ask the chat about SGLang in context. It will compare to siblings at the same layer and ground every claim in the wiki.

Other projects at the Runtime layer

6 siblings · ordered open first