The Open-Source AI Stack
RSS

Glossary

AWQ

A post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss.

Weights also: Runtime aka awq, activation-aware weight quantization

A post-training weight quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry method introduced by MIT researchers in mid-2023. The observation behind AWQ: not all weight channels matter equally. A small fraction (roughly 0.1 to 1 percent) of channels see the largest activation magnitudes and therefore dominate model output; quantizing them aggressively breaks the model. AWQ identifies these “salient” channels using calibration data and keeps them at higher precision while quantizing the rest to 4 bits.

In practice, AWQ produces 4-bit weights with smaller quality degradation than naive 4-bit quantization, comparable to or better than GPTQweightsA post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data. Open full entry on many benchmarks. The mechanism is different from GPTQ’s layer-wise reconstruction but the outcome is similar: aggressive memory savings (~4x) with small accuracy cost.

vLLM, SGLang, TensorRT-LLM, and most production serving engines support AWQ alongside GPTQ. Which format performs better on a given deployment depends on the kernel implementations the engine has for each, not on the underlying math; on H100s with optimized AWQ kernels, AWQ often edges out GPTQ on throughput. The two are practical alternatives and most quantization workflows publish both.

Sources

Mentioned in

Back to glossary