llama.cpp

llama.cpp is Georgi Gerganov's open inference engine for large language models. Pure C/C++ with no required runtime dependencies, MIT-licensed. The project also defines the GGUF file format, which has become the de facto standard for distributing quantized open-weights models. Backend support includes CPU SIMD, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, and more, with the Apple Silicon backend particularly well- tuned. llama.cpp matters because it is the only major inference engine designed first for hardware you actually own. Where vLLM and SGLang target server-class NVIDIA GPUs, llama.cpp targets laptops, desktops, Macs, Raspberry Pis, and small-VRAM consumer cards. Combined with Apple Silicon's unified-memory bandwidth, llama.cpp is the substrate that makes "a 70B model running on your Mac at usable speed" possible. Compared to siblings: Ollama is the polished UX layer wrapping llama.cpp, MLC-LLM is a cross-platform compilation play, MLX is Apple's first-party ML framework. llama.cpp is the engine the others build on or compete with directly. Production-ready and the on-device standard. Used by Ollama, LM Studio, Jan, Faraday, and many other consumer AI apps that run inference on the user's machine. Maintained by an energetic community led by Gerganov; the project's pace is among the fastest in open AI infrastructure. The quiet strategic significance: llama.cpp is the layer that makes the sovereignty-anchored "buy your own hardware, run your own models" thesis operational today.

Other projects at the Runtime layer

12 siblings · ordered open first

vLLM Open source

Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.

SGLang Open source

RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

Ollama Open source

Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.

Text Generation Inference (TGI) Open source

HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.

MLC-LLM Open source

Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.

Outlines Open source

Structured-output library using finite-state-machine guided decoding to force model output to match a regex, JSON Schema, or grammar.

XGrammar Open source

Fast grammar-based constrained-decoding engine, used as a structured-output backend by several open serving engines.

llguidance Open source

Low-level Rust constrained-decoding engine that enforces grammars and JSON Schema at high throughput; powers Guidance.

Guidance Open source

A programming model for constrained generation that interleaves control flow with model calls and enforces regex, grammars, and JSON Schema.

Instructor Open source

Library for getting Pydantic-typed structured output from LLMs across many providers, built on function calling and JSON mode with automatic retries.

LM Format Enforcer Open source

Token-filtering library that enforces a JSON Schema or regex during generation; integrates with Transformers, vLLM, and llama.cpp.

TensorRT-LLM Source available

NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.

Sources

Other projects at the Runtime layer