The Open-Source AI Stack
RSS

The Stack · Runtime · Open source

llama.cpp

Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.

llama.cpp is Georgi Gerganov's open inference engine for large language models. Pure C/C++ with no required runtime dependencies, MIT-licensed. The project also defines the GGUF file format, which has become the de facto standard for distributing quantized open-weights models. Backend support includes CPU SIMD, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, and more, with the Apple Silicon backend particularly well- tuned. llama.cpp matters because it is the only major inference engine designed first for hardware you actually own. Where vLLM and SGLang target server-class NVIDIA GPUs, llama.cpp targets laptops, desktops, Macs, Raspberry Pis, and small-VRAM consumer cards. Combined with Apple Silicon's unified-memory bandwidth, llama.cpp is the substrate that makes "a 70B model running on your Mac at usable speed" possible. Compared to siblings: Ollama is the polished UX layer wrapping llama.cpp, MLC-LLM is a cross-platform compilation play, MLX is Apple's first-party ML framework. llama.cpp is the engine the others build on or compete with directly. Production-ready and the on-device standard. Used by Ollama, LM Studio, Jan, Faraday, and many other consumer AI apps that run inference on the user's machine. Maintained by an energetic community led by Gerganov; the project's pace is among the fastest in open AI infrastructure. The quiet strategic significance: llama.cpp is the layer that makes the sovereignty-anchored "buy your own hardware, run your own models" thesis operational today.

Sources

Want a follow-up? Ask the chat about llama.cpp in context. It will compare to siblings at the same layer and ground every claim in the wiki.

Other projects at the Runtime layer

6 siblings · ordered open first