03 Quantization formats

self-host

GGUF, GPTQ, AWQ, NF4, EXL2, EXL3, FP8, FP4, MLX, ONNX. None are interchangeable.

quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry compresses model weights from their native FP16siliconA 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads. Open full entry or BF16siliconA 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16. Open full entry training precision down to lower-bit formats so they take less memory and stream faster from HBM or unified memory to compute units. The arithmetic is the same idea every time (map a continuous weight range to a discrete grid of values, store the grid index plus a scale), but the file formats and the kernels that consume them are not interchangeable. A model quantized for one engine usually cannot be loaded by another without conversion, and the conversion may not be exact.

GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry is the llama.cpp container. It packages weights, tokenizer, prompt template, and metadata into one file, then layers a family of quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry schemes on top: GPTQweightsA post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data. Open full entry , k-quants with different block structures, and newer i-quants that improve quality at very low bit counts. GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry is the dominant on-device format. Ollama, LM Studio, GPT4All, and most consumer local runtimes consume it directly. Hugging Face hosts thousands of community-quantized GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry variants of official safetensors releases.

GPTQweightsA post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data. Open full entry is a layer-wise post-training weight quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry scheme designed for GPU inference (Frantar et al., 2022). It uses approximate second-order information to choose quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry values that minimize per-layer reconstruction error. GPTQ files are typically 4-bit with optional grouping, and the format is consumed by ExLlamaV2, AutoGPTQ, and vLLM’s GPTQ kernels. The format predates AWQ but remains widely used for consumer-GPU 4-bit serving.

AWQweightsA post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss. Open full entry (Activation-aware Weight Quantization, Lin et al., 2023) takes a different view: instead of treating all weights equally, it identifies the small fraction of weight channels that disproportionately affect activations and protects them from aggressive quantization. The result is better quality at the same bit count for many models, especially instruction-tuned variants. AWQ files are consumed by vLLM, TensorRT-LLM, and the AutoAWQ toolchain.

NF4 is the 4-bit normal float used by the bitsandbytes library. It is the format behind QLoRA fine-tuning and many Hugging Face Transformers quantized loads. NF4 maps weight values to a non-uniform grid chosen to fit the normal distribution that pretrained weights tend to follow, which improves quality versus uniform 4-bit at minimal cost. It is most commonly seen during training-time loads rather than production serving.

GPTQweightsA post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data. Open full entry are ExLlamaV2 and ExLlamaV3’s native formats, optimized for consumer NVIDIA GPUs with single-stream decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry in mind. EXL2 supports mixed-precision quantization where different layers run at different bit counts within the same model, letting authors tune for a target file size. EXL3 extends this with newer kernel paths and tighter integration with FlashAttentionruntimeAn exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss. Open full entry . Both formats are the typical choice for single-RTX-card workstations running 30B to 70B models.

FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry and FP4siliconA 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization. Open full entry are native hardware formats on H100 and B200 class GPUs. Unlike post-training quantization formats above, FP8 and FP4 are floating-point types with their own exponent and mantissa bits, and the hardware multiplies them natively. Models trained in FP8 (DeepSeek-V3 is one) avoid the post-training quantization step entirely for these cards. FP4 is newer and tied to Blackwell-generation accelerators.

MLX format is Apple’s quantization scheme for unified memory hardware. The MLX framework uses its own weight layout optimized for the M-series chips’ shared memory model and supports 4-bit, 8-bit, and mixed-precision variants. MLX files are not interchangeable with GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry even at the same nominal bit count.

ONNX is a portable serialization format with optional quantization support, used by ONNX Runtime and OpenVINO for cross-platform deployment. ONNX favors compatibility over peak performance; it is the format of choice when the same model needs to ship to multiple hardware targets (NVIDIA, AMD, Intel, ARM) without re-training.

The implication for self-hosting: “this 13B model fits in 6 GB” claims are runtime-specific. The same nominal weight precision can have very different memory footprints in different engines because of how each one keeps dequantized scratch buffers, KV cache representation (often quantized separately), and intermediate activations. A Q4_K_M file in llama.cpp at 8K context uses meaningfully more RAM than the same file at 2K context, and meaningfully less than the equivalent AWQ file in vLLM at the same context because vLLM keeps a separate PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry in fixed blocks. Always measure the actual configuration you intend to run.