The Open-Source AI Stack
RSS

Glossary

GPTQ

A post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data.

Weights also: Runtime aka gptq, generative pre-trained transformer quantization

A one-shot post-training weight quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry method for transformers, introduced by Frantar et al. in late 2022. GPTQ quantizes one layer at a time, using a small calibration dataset to choose per-weight scales that minimize the reconstruction error of that layer’s output. The result is 3-bit or 4-bit weights with roughly the same quality as 8-bit naive quantization, at a quarter or half the memory footprint.

GPTQ is one of three quantization formats that dominate the open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry ecosystem for GPU inference, alongside AWQweightsA post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss. Open full entry and the bitsandbytes NF4weightsA 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights. Open full entry 4-bit format. Each engine has its preferred format: vLLM and SGLang support GPTQ and AWQ natively with optimized kernels; ExLlamaV2/V3 use their own EXL2/EXL3 formats which build on similar layer-wise optimization ideas with additional tricks. GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry inside llama.cpp uses a different per-channel scheme that interleaves quantization scales differently.

The practical implication: weights that have been GPTQ-quantized are not directly portable to non-GPTQ engines. To switch engines, you either pick a format the new engine has optimized kernels for, or you re-quantize from the FP16/BF16 source weights. This is part of why the format zoo (GPTQ, AWQ, NF4, EXL2, EXL3, GGUF variants, FP8, FP4, MLX formats, ONNX) reads as messy: there’s no universal representation, only ecosystem-specific ones.

Sources

Mentioned in

Back to glossary