Ollama is a local model runner that wraps llama.cpp with a Docker-style command-line and HTTP API. Install Ollama, run `ollama pull llama3.3` and `ollama run llama3.3`, and you have a local model serving over a localhost API endpoint. MIT- licensed. Ships native installers for macOS, Linux, and Windows. Ollama matters because it makes local inference accessible to developers who do not want to learn the llama.cpp build flags and quantization settings. The model library handles GGUF downloads from a curated registry; the API speaks an OpenAI-compatible shape (with extensions) so existing client code works against Ollama with a baseURL change. Compared to siblings: llama.cpp is the engine underneath (more flexibility, more setup), LM Studio is a GUI alternative, MLX is Apple's direct API for Apple Silicon. Ollama is the most-deployed local runner among developers who want minimal friction. Production-ready for development and personal use. Used as the backend for many local-AI applications, IDE plugins, and personal-AI projects (HRF-funded Orchard pairs Ollama with Lightning and Cashu). The strategic position: Ollama is the "easy button" for local AI; for production-scale serving you generally move to vLLM or SGLang. The growing question is whether Ollama's commercialization plans (paid tiers, hosted services) preserve the open-source character that made it useful in the first place.
The Stack · Runtime · Open source
Ollama
Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.
Sources
- Ollama https://ollama.com/
- Ollama on GitHub https://github.com/ollama/ollama
- Ollama API Documentation https://github.com/ollama/ollama/blob/main/docs/api.md
Want a follow-up? Ask the chat about Ollama in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Runtime layer
6 siblings · ordered open first
- vLLM Open source
Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.
- SGLang Open source
RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.
- llama.cpp Open source
Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.
- Text Generation Inference (TGI) Open source
HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.
- MLC-LLM Open source
Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.
- TensorRT-LLM Source available
NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.