Glossary
on-device
Running model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX.
The concrete implementation pattern behind local-first AI. inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
Open full entry
runs on the user’s CPU, GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
Open full entry , NPU, or unified-memory chip rather than
on a remote API. Three reference platforms exist in 2026: Apple
Silicon (M-series and A-series), running via MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
Open full entry or llama.cpp;
modern Android with NPU acceleration; and PC builds with consumer
GPUs running OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine.
Open full entry or LM StudioruntimeA desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools.
Open full entry .
The model-size envelope has expanded faster than hardware. A high-end Mac Studio can run LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -3-70B at Q4 quantization at roughly 8 to 14 tokens per second; a base M-series MacBook can run 8B models in real time; a Pixel phone can run 1B to 3B models for on-device assistants. The quality of those small models has grown to match GPT-3.5 levels and beyond.
The remaining bottlenecks. Long context is expensive on local memory. Multi-turn agent loops with tools (the on-device equivalent of Claude Code) are an open research area. And frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry -class intelligence (the largest reasoning models) still does not fit on consumer hardware.