llama.cpp is Georgi Gerganov's open inference engine for large language models. Pure C/C++ with no required runtime dependencies, MIT-licensed. The project also defines the GGUF file format, which has become the de facto standard for distributing quantized open-weights models. Backend support includes CPU SIMD, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, and more, with the Apple Silicon backend particularly well- tuned. llama.cpp matters because it is the only major inference engine designed first for hardware you actually own. Where vLLM and SGLang target server-class NVIDIA GPUs, llama.cpp targets laptops, desktops, Macs, Raspberry Pis, and small-VRAM consumer cards. Combined with Apple Silicon's unified-memory bandwidth, llama.cpp is the substrate that makes "a 70B model running on your Mac at usable speed" possible. Compared to siblings: Ollama is the polished UX layer wrapping llama.cpp, MLC-LLM is a cross-platform compilation play, MLX is Apple's first-party ML framework. llama.cpp is the engine the others build on or compete with directly. Production-ready and the on-device standard. Used by Ollama, LM Studio, Jan, Faraday, and many other consumer AI apps that run inference on the user's machine. Maintained by an energetic community led by Gerganov; the project's pace is among the fastest in open AI infrastructure. The quiet strategic significance: llama.cpp is the layer that makes the sovereignty-anchored "buy your own hardware, run your own models" thesis operational today.
The Stack · Runtime · Open source
llama.cpp
Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.
Sources
- llama.cpp on GitHub https://github.com/ggml-org/llama.cpp
- GGUF Format Specification https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
- ggml (the underlying tensor library) https://github.com/ggml-org/ggml
- github.com (audit-verified) https://github.com/ggerganov
Want a follow-up? Ask the chat about llama.cpp in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Runtime layer
6 siblings · ordered open first
- vLLM Open source
Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.
- SGLang Open source
RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.
- Ollama Open source
Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.
- Text Generation Inference (TGI) Open source
HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.
- MLC-LLM Open source
Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.
- TensorRT-LLM Source available
NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.