Ollama

Ollama is a local model runner that wraps llama.cpp with a Docker-style command-line and HTTP API. Install Ollama, run `ollama pull llama3.3` and `ollama run llama3.3`, and you have a local model serving over a localhost API endpoint. MIT- licensed. Ships native installers for macOS, Linux, and Windows. Ollama matters because it makes local inference accessible to developers who do not want to learn the llama.cpp build flags and quantization settings. The model library handles GGUF downloads from a curated registry; the API speaks an OpenAI-compatible shape (with extensions) so existing client code works against Ollama with a baseURL change. Compared to siblings: llama.cpp is the engine underneath (more flexibility, more setup), LM Studio is a GUI alternative, MLX is Apple's direct API for Apple Silicon. Ollama is the most-deployed local runner among developers who want minimal friction. Production-ready for development and personal use. Used as the backend for many local-AI applications, IDE plugins, and personal-AI projects (HRF-funded Orchard pairs Ollama with Lightning and Cashu). The strategic position: Ollama is the "easy button" for local AI; for production-scale serving you generally move to vLLM or SGLang. The growing question is whether Ollama's commercialization plans (paid tiers, hosted services) preserve the open-source character that made it useful in the first place.

Other projects at the Runtime layer

12 siblings · ordered open first

vLLM Open source

Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.

SGLang Open source

RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

llama.cpp Open source

Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.

Text Generation Inference (TGI) Open source

HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.

MLC-LLM Open source

Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.

Outlines Open source

Structured-output library using finite-state-machine guided decoding to force model output to match a regex, JSON Schema, or grammar.

XGrammar Open source

Fast grammar-based constrained-decoding engine, used as a structured-output backend by several open serving engines.

llguidance Open source

Low-level Rust constrained-decoding engine that enforces grammars and JSON Schema at high throughput; powers Guidance.

Guidance Open source

A programming model for constrained generation that interleaves control flow with model calls and enforces regex, grammars, and JSON Schema.

Instructor Open source

Library for getting Pydantic-typed structured output from LLMs across many providers, built on function calling and JSON mode with automatic retries.

LM Format Enforcer Open source

Token-filtering library that enforces a JSON Schema or regex during generation; integrates with Transformers, vLLM, and llama.cpp.

TensorRT-LLM Source available

NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.

Sources

Other projects at the Runtime layer