Glossary

TGI

Hugging Face's production inference server, an early peer of vLLM that ceded throughput leadership in 2024 and now sits in maintenance mode behind vLLM and SGLang.

Runtime aka Text Generation Inference

Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry ’s open inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry server, released in 2022 as one of the first production-grade serving stacks for open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry LLMs. TGI shipped Rust-based request handling, Python-based model code, and CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry kernels for transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry and continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry .

Through 2023 TGI was the credible alternative to NVIDIA’s TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry . The release of vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry and the rapid maturation of SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry in 2024 narrowed the niche; by 2026 TGI is in maintenance mode within the Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry ecosystem, used mostly for legacy deployments and on Hugging Face’s own inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry Endpoints product.

Sources

Text Generation Inference

Mentioned in

Back to glossary