Glossary

inference

Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.

Runtime also: Silicon also: Agents also: Evaluation

The forward pass: input goes in, weights stay fixed, output comes out. Inference is what every API call to a hosted model does and what every self-served instance does locally. The economics of inference dominate operational cost for any model deployed at scale because each request runs the full forward pass and pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry is amortized over the model’s serving lifetime.

For language models, inference has two phases. Prefill: process the prompt in parallel to fill the KV cache. Decode: generate one token at a time autoregressively. Prefill is compute-bound on long prompts; decode is memory-bandwidth-bound on long generations. Runtime engines optimize these two regimes separately.

Open-source inference engines (vLLM, SGLang, llama.cpp, MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone. Open full entry ) compete on throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry , memory efficiency, and supported features. The choice drives cost more than the model choice does: the same model on a poorly-tuned runtime can cost ten times more per token than on a well-tuned one.

Sources

vLLM: A high-throughput serving engine for LLMs

Back to glossary

inference

Sources

Mentioned in