Glossary

NF4

A 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights.

Weights also: Training also: Runtime aka nf4, 4-bit normal float, normalfloat4

A 4-bit weight quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry format introduced in the QLoRA paper by Tim Dettmers and collaborators in May 2023. The design observation: pretrained transformer weights follow a roughly normal distribution. Standard 4-bit quantization spaces its 16 levels uniformly from -1 to 1, which wastes precision in the tails and underrepresents the densely-populated middle. NF4 instead places its 16 levels at the quantiles of a standard normal distribution, so the level density matches the weight density.

NF4 is the default 4-bit format inside bitsandbytes, the library that powers most fine-tuning workflows in Hugging Face’s transformers. The format pairs naturally with LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry adapters: load the base weights at NF4, train small high-precision LoRA adapters on top, and the combined memory footprint stays tiny. QLoRA used this combination to fine-tune Llama 65B on a single 48 GB A6000.

Production serving engines (vLLM, SGLang, TensorRT-LLM) typically prefer GPTQweightsA post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data. Open full entry or AWQweightsA post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss. Open full entry over NF4 for inference because their kernels are more optimized; NF4’s natural habitat is fine-tuning and research workflows.

Sources

QLoRA paper (Dettmers et al., 2023)

Mentioned in

GPTQ

Back to glossary