Glossary

GGUF

A binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023.

Weights also: Runtime

A single-file container for quantized models, the successor to the older llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry format. A GGUF file packages weights, tokenizerdataThe component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding. Open full entry , prompt template, and metadata together, so a user can download one file and run it without further configuration.

GGUF is the format of choice for on-devicesovereignty-decentralizationRunning model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX. Open full entry deployment. llama.cpp, OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine. Open full entry , LM StudioruntimeA desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools. Open full entry , GPT4All, and most other consumer-facing local runtimes consume it. Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry ’s model hub now lists thousands of GGUF-format models, often community-quantized versions of the official safetensors releases.

The format itself is unopinionated about the quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry scheme. A typical LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -3-70B might exist on the hub in a dozen quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry variants (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc), each a trade-off between size and quality. The community has converged on Q4_K_M as the default for serious work and Q5_K_M or Q6_K for quality-sensitive tasks.

Sources

GGUF specification (llama.cpp repository)

Mentioned in

Back to glossary