Glossary

llama.cpp

Georgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products.

Runtime also: Sovereignty and Decentralization Primitives aka llama-cpp, ggml

The reference local-firstsovereignty-decentralizationAn architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation. Open full entry LLM runtime. LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry .cpp started as a weekend port of the original LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry model to C++ with hand-tuned ARM and AVX kernels; by 2024 it covered virtually every modern open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry model, nearly every quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry scheme via the GGUF format, GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry offload via Metal, CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry , ROCmsiliconAMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling. Open full entry , and Vulkan, and an HTTP server compatible with the OpenAI API.

OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine. Open full entry , LM StudioruntimeA desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools. Open full entry , GPT4All, Jan, and most other consumer-facing local runtimes are llama.cpp under different UIs. The project is permissively licensed and community-maintained.

Full coverage at /projects/llama-cpp.

Sources

llama.cpp GitHub

Mentioned in

Back to glossary