Glossary

quantization

Storing or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss.

Weights also: Runtime also: Silicon

The set of techniques for representing model weights and activations in fewer bits than the training precision. A 70B parameter model in FP16 is 140 GB; in INT4 it is 35 GB; in many low-bit formats it runs on a single consumer GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry . The cost is some quality loss, which modern post-trainingtrainingEverything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant. Open full entry quantization (GPTQ, AWQ, GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry /llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry ’s k-quant formats) has reduced to a fraction of a percent on most benchmarks for 4-bit and barely measurable for 8-bit.

The format choice depends on deployment. FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry is the data-center default on Hopper/Blackwell and MI300X, supported by mature kernels and minimal quality loss. INT4 (GPTQ, AWQ) is common for inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry on older hardware without FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry support. GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry k-quants (Q4_K_M, Q5_K_M, etc.) target on-devicesovereignty-decentralizationRunning model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX. Open full entry deployment via LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry .cpp and balance speed and quality.

Training in low precision is harder than inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry . FP8 mixed- precision training has become production-ready on Hopper; 4-bit training and 2-bit training remain research. Inference quantization is mature; training quantization is still an active frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry in 2026.

Sources

Mentioned in

Back to glossary