Glossary

TensorRT-LLM

NVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA.

Runtime also: Silicon aka tensorrt llm

NVIDIA’s production inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry engine, the closed-runtime counterpart to vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry and SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry . TensorRT-LLM ships hand-tuned kernels for the latest NVIDIA architectures (Hopper, Blackwell) and integrates with the broader TensorRT compiler stack. On NVIDIA hardware it is usually the throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry leader, often by 10 to 30 percent over the best open alternatives.

The trade-off is openness. TensorRT-LLM is NVIDIA-only by design and depends on closed CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry kernels; a model deployed on TensorRT-LLM is locked to NVIDIA. The wrapper itself is permissively licensed; the performance comes from kernels that are not.

Sources

TensorRT-LLM documentation

Mentioned in

Back to glossary