The Open-Source AI Stack
RSS

Glossary

local-first

An architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation.

A design principle. inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry happens on the user’s hardware (laptop, phone, edge device). The user keeps custody of prompts, outputs, and any retrieved data. Sync to the cloud, if any, is opt-in and end-to-end encrypted. The reference point is the 2019 Ink & Switch essay on local-first software, predating modern LLMs but mapping cleanly onto them.

For LLMs the practical limit is model size. LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -3-8B and MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry -7B run well on consumer hardware via llama.cpp and Ollama; LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -3-70B fits on a high-end Mac or pair of consumer GPUs; frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry -scale models (400B+ parameters) do not. The trend through 2024 to 2026 of smaller- but-capable models (Phi, QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry -7B, GemmaweightsGoogle's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook. Open full entry -3) has expanded what local- first can credibly serve.

The sovereignty case is direct. Local-first inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry means no third party sees the user’s queries. The Apple Intelligence design (mostly on-devicesovereignty-decentralizationRunning model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX. Open full entry , with confidential-compute fallback) and the Maple AI posture (local-first plus TEEidentity-trustA hardware-isolated CPU region where code and data are protected from inspection by the host OS, used to run inference in a way the operator cannot read or modify. Open full entry fallback) are the two most prominent 2026 productizations of the pattern.

Sources

Mentioned in

Back to glossary