Glossary
FP8
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
An 8-bit floating-point number representation with two common variants: E4M3 (4 exponent bits, 3 mantissa bits, more precision, less range) and E5M2 (5 and 2, more range, less precision). FP8 emerged as a practical inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry precision on NVIDIA Hopper (the transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry Engine) and is now common across Hopper, Blackwell, MI300X, and other 2024+ accelerators.
The motivation is straightforward. FP16 weights cost 2 bytes each; FP8 costs 1 byte. A 70B-parameter model goes from 140 GB to 70 GB, fitting comfortably on a single 80 GB H100. Memory bandwidth pressure on attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry and matmul halves at the same time, improving throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry .
Training in FP8 is harder than inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry : the lower precision makes gradient updates noisier and stability touchier. DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry -V3 reported production FP8 training, and several open follow-ons have replicated the technique. The default in 2026 is FP8 inference, mixed-precision FP8/BF16 training, with full-precision rare.