The Open-Source AI Stack
RSS

Glossary

FP4

A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.

Silicon also: Runtime also: Weights aka fp4, 4-bit floating point, mxfp4, nvfp4

A 4-bit floating-point number, half the width of FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry and one quarter the width of FP16siliconA 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads. Open full entry . Two variants matter in practice. MXFP4 is defined by the Open Compute Project’s Microscaling Formats spec: E2M1 elements (2 exponent bits, 1 mantissa bit) with a shared block-level scale, hardware-neutral. NVFP4 is NVIDIA’s variant of the same E2M1 element with a different block scaling scheme, targeting the Blackwell generation’s native FP4 tensor cores.

The motivation is the same as every other precision drop. FP16 weights cost 2 bytes per parameter; FP4 costs 0.5. A 70B-parameter model goes from 140 GB at FP16 to roughly 35 GB at FP4, fitting on a single 48 GB card. Decode bandwidth (the binding constraint on decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry throughput) drops by a factor of 4 versus FP16 because each token must stream a quarter as much weight data.

FP4 in 2026 is mostly an inference format. NVIDIA’s B200 and B300 parts have native FP4 tensor cores; the DGX Spark developer appliance also supports NVFP4. Pretrained models are typically quantized to FP4 after training with calibration to limit quality loss. FP4 training is an active research area but not standard practice; the precision is low enough that gradient noise compounds quickly. Models served at FP4 commonly use FP8 or BF16siliconA 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16. Open full entry for sensitive layers (attention scores, layer norm parameters) while keeping the bulk of the weight matrix at FP4.

Sources

Back to glossary