Glossary

ONNX

An open interchange format for machine learning models, designed to let a model trained in one framework run in another via a portable graph representation.

Runtime also: Governance aka Open Neural Network Exchange

An open format for representing ML models as a computation graph plus weights, originally launched by Microsoft and Facebook in 2017. The goal was framework portability: train in PyTorch, deploy in a different runtime, run on hardware whose vendor optimized for ONNX rather than for every framework natively.

In practice ONNX has had mixed adoption for LLMs. The format predates modern transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry architectures and the operator coverage required extension for attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry , KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry , and quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry . ONNX Runtime (Microsoft) is the reference runtime and works for most production deployment patterns, particularly on Windows and edge devices. For on-Mac or on-Linux LLM serving, the GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry + llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry path is more common.

Sources

Mentioned in

ZKML

Back to glossary