Glossary
inference
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
The forward pass: input goes in, weights stay fixed, output comes out. Inference is what every API call to a hosted model does and what every self-served instance does locally. The economics of inference dominate operational cost for any model deployed at scale because each request runs the full forward pass and pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry is amortized over the model’s serving lifetime.
For language models, inference has two phases. Prefill: process the
prompt in parallel to fill the KV cache. Decode: generate one token at
a time autoregressively. Prefill is compute-bound on long prompts;
decode is memory-bandwidth-bound on long generations. Runtime engines
optimize these two regimes separately.
Open-source inference engines (vLLM, SGLang, llama.cpp, MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
Open full entry ) compete
on throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
Open full entry , memory efficiency, and supported features. The choice
drives cost more than the model choice does: the same model on a
poorly-tuned runtime can cost ten times more per token than on a
well-tuned one.
Sources
Mentioned in
- attention
- attestation
- batching
- Cerebras
- layer Compute
- confidential computing
- continuous batching
- CUDA
- decentralized GPU marketplace
- DeepSpeed
- FlashAttention
- FP8
- GPU
- Groq
- HBM
- Hugging Face
- layer Infrastructure
- knowledge distillation
- KV cache
- local-first
- LoRA
- Mixtral
- mixture of experts
- MLX
- NVLink
- Ollama
- on-device
- PagedAttention
- PEFT
- Petals
- quantization
- ROCm
- scheduler
- SGLang
- spot instance
- TEE
- TensorRT-LLM
- TGI
- tokenizer
- TPU
- verifiable inference
- vLLM
- ZKML