SGLang is an open inference engine for large language models from the LMSYS team (the same group behind Chatbot Arena). Apache 2.0. The central technical contribution is RadixAttention, a KV-cache sharing scheme that reuses computation across requests with shared prefixes (system prompts, few-shot examples, multi-turn agent conversations) by storing the cache in a radix-tree structure. SGLang also has strong primitives for structured generation (constraining output to a JSON schema, a grammar, or a regex). SGLang matters as the leading vLLM alternative on the open runtime side. Compared to vLLM (PagedAttention, broader hardware support, larger ecosystem), SGLang's distinctive wins are scenarios with heavy prompt-prefix sharing (agent loops, multi- turn chat with long system prompts) and structured-output requirements. llama.cpp is the local-first sibling for hardware you own; SGLang and vLLM target server-class GPUs. TensorRT-LLM is the closed NVIDIA-only counterpart. Production-ready. Used by xAI for Grok, by hosted-inference providers for shared-prefix workloads, and increasingly by teams running agent fleets where prefix caching pays off. The strategic position is not "displace vLLM" but "be the right engine for workloads where RadixAttention matters." Both vLLM and SGLang continue to converge on each other's strengths.
The Stack · Runtime · Open source
SGLang
RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.
Sources
- SGLang Project https://sglang.ai/
- SGLang on GitHub https://github.com/sgl-project/sglang
- SGLang Paper (Efficient Execution of Structured LM Programs) https://arxiv.org/abs/2312.07104
Want a follow-up? Ask the chat about SGLang in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Runtime layer
6 siblings · ordered open first
- vLLM Open source
Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.
- llama.cpp Open source
Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.
- Ollama Open source
Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.
- Text Generation Inference (TGI) Open source
HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.
- MLC-LLM Open source
Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.
- TensorRT-LLM Source available
NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.