Glossary
What the terms mean
Definitions for the concepts, protocols, and projects this site uses across its 15 layers. Each entry has a short hover summary and a deeper page. Most show up inline in the prose as dotted underlines you can hover or tap.
174 entries · 15 layers covered
Infrastructure
16 terms
- AI factory
A purpose-built data center optimized for AI training rather than general cloud workloads, characterized by liquid-cooled high-density GPU racks, gigawatt-scale single-tenant power, and tightly-coupled networking.
- behind-the-meter
A power arrangement where generation sits on the same side of the utility meter as the load, letting a data center draw directly from the plant and bypass the grid.
- decentralized GPU marketplace
A protocol market matching GPU supply from many independent providers to AI demand, settled on a token rail; Akash, io.net, Bittensor compute, and Hyperbolic are canonical.
- decode also in Runtime
The second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute.
- direct-to-chip cooling
A cooling architecture that pipes liquid coolant directly to a cold plate on each processor, evacuating the 700+ watts per GPU that air cooling cannot handle.
- expert parallelism also in Runtime
A parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step.
- gigawatt-class cluster
An AI training facility whose power draw is measured in gigawatts rather than megawatts, the scale at which siting decisions become grid-and-permitting problems rather than real-estate ones.
- grid interconnect queue
The regulatory queue a new generation or load project must traverse to connect to the grid; currently the binding constraint on how fast gigawatt-class AI sites come online.
- hyperscaler capex
The capital expenditure that Microsoft, Google, Amazon, Meta, and Oracle are spending on AI infrastructure, totaling roughly $300B+ annually by 2025-2026 and dominating the supply-and-demand signal for the entire stack.
- neocloud
A specialized cloud provider focused exclusively on GPU and AI workloads, operating outside the traditional AWS/Azure/GCP hyperscaler perimeter, with CoreWeave, Lambda, and Voltage Park as the canonical examples.
- nuclear PPA
A long-term contract under which an AI operator commits to buy a defined nuclear-generated power output, becoming the cornerstone financing mechanism for gigawatt-class AI buildouts.
- prefill also in Runtime
The first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens.
- PUE
Power Usage Effectiveness, the ratio of total data center facility power to power delivered to IT equipment; lower is better, with 1.0 the floor and 1.1 a strong target.
- SMR
A nuclear reactor under ~300 MW per unit, factory-fabricated rather than site-built, positioned as the firm-power option for AI data centers needing new generation faster than conventional plants deliver.
- sovereign compute
AI compute capacity owned, operated, or contractually controlled by a nation-state for the use of its own institutions and citizens, distinct from rented capacity on US-hyperscaler clouds.
- tensor parallelism also in Runtime
A way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained.
Silicon
41 terms
- arithmetic intensity also in Runtime
FLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity.
- attestation also in Identity and Trust
A cryptographic protocol that lets a remote party verify which code is running inside a TEE, including which model is loaded and which build of the inference engine.
- BF16
A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.
- Cerebras
An AI compute company built around wafer-scale chips (the WSE-3 is a single die covering most of a 300mm wafer), offering some of the lowest inference latency on the market.
- confidential computing also in Identity and Trust
The umbrella category of compute architectures where workloads run isolated from the host operator, combining hardware TEEs, attestation, and encrypted-memory protections.
- CUDA
NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.
- decode also in Runtime
The second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute.
- direct-to-chip cooling also in Infrastructure
A cooling architecture that pipes liquid coolant directly to a cold plate on each processor, evacuating the 700+ watts per GPU that air cooling cannot handle.
- FlashAttention also in Runtime
An exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss.
- FP16
A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.
- FP4
A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.
- FP8
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
- GDDR7
The graphics memory generation on 2025-era consumer and workstation GPUs such as the RTX 5090 and RTX PRO 6000. High bandwidth per board, lower capacity than HBM.
- gigawatt-class cluster also in Infrastructure
An AI training facility whose power draw is measured in gigawatts rather than megawatts, the scale at which siting decisions become grid-and-permitting problems rather than real-estate ones.
- GPU
A massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
- Groq
An AI inference company with custom deterministic LPU chips and a hosted inference service that achieves extremely low time-per-token (1000+ tokens/sec on 70B models).
- HBM
Stacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB.
- inference also in Runtime
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
- InfiniBand also in Compute
A high-throughput, low-latency network fabric (Mellanox, now NVIDIA) used for inter-node communication in AI training clusters, supporting RDMA for direct GPU-to-GPU transfer across machines.
- KV cache also in Runtime
The stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix.
- LPDDR5X
Low-power DRAM used as unified memory in Apple Silicon, DGX Spark, and Strix Halo. High capacity and efficiency, with bandwidth below HBM and GDDR.
- memory bandwidth
The rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once.
- MLX also in Runtime
Apple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
- model bandwidth utilization also in Runtime
MBU is the fraction of an accelerator's peak memory bandwidth a serving stack actually reaches during decode. Real systems land around 60 to 85 percent.
- model FLOPs utilization also in Runtime
MFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode.
- NVLink also in Compute
NVIDIA's proprietary GPU-to-GPU interconnect, providing bandwidth an order of magnitude above PCIe and the basis for tightly-coupled 8-GPU server nodes (DGX, HGX).
- on-device also in Sovereignty and Decentralization Primitives
Running model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX.
- prefill also in Runtime
The first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens.
- quantization also in Weights
Storing or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss.
- RDMA also in Compute
A networking technique that lets a remote machine read or write local memory without involving the CPU, foundational for high-throughput distributed training over InfiniBand or RoCE.
- RISC-V
An open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86.
- ROCm
AMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling.
- roofline also in Runtime
A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.
- SGX also in Identity and Trust
Intel's earliest mainstream trusted execution environment, the predecessor to TDX, with smaller enclave sizes and a history of side-channel vulnerabilities that limited its deployment for AI.
- TEE also in Identity and Trust
A hardware-isolated CPU region where code and data are protected from inspection by the host OS, used to run inference in a way the operator cannot read or modify.
- TensorRT-LLM also in Runtime
NVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA.
- Tenstorrent
An AI accelerator startup designing RISC-V-based chips (Wormhole, Blackhole, Grendel) with an open software stack, positioned as the leading open alternative to NVIDIA at the silicon layer.
- tokens per second also in Runtime
The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.
- TPU
Google's custom AI accelerator family, used internally for training Gemini and externally via Google Cloud, designed around dense matrix multiplication with a systolic array architecture.
- unified memory
A single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark.
- VRAM math also in Weights
The first-pass formula for whether a model fits on a GPU. VRAM ≈ parameters × (bits ÷ 8), plus 10-30 percent for KV cache, activations, and overhead.
Compute
23 terms
- AI factory also in Infrastructure
A purpose-built data center optimized for AI training rather than general cloud workloads, characterized by liquid-cooled high-density GPU racks, gigawatt-scale single-tenant power, and tightly-coupled networking.
- batching
Grouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput.
- decentralized training also in Sovereignty and Decentralization Primitives
Training a model across many independently-operated nodes that are not tightly coupled, contrasted with single-cluster training; the architecture for community-owned model production.
- DeepSpeed also in Training
Microsoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale.
- GDDR7 also in Silicon
The graphics memory generation on 2025-era consumer and workstation GPUs such as the RTX 5090 and RTX PRO 6000. High bandwidth per board, lower capacity than HBM.
- gigawatt-class cluster also in Infrastructure
An AI training facility whose power draw is measured in gigawatts rather than megawatts, the scale at which siting decisions become grid-and-permitting problems rather than real-estate ones.
- GPU also in Silicon
A massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
- HBM also in Silicon
Stacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB.
- InfiniBand
A high-throughput, low-latency network fabric (Mellanox, now NVIDIA) used for inter-node communication in AI training clusters, supporting RDMA for direct GPU-to-GPU transfer across machines.
- latency
The time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
- LPDDR5X also in Silicon
Low-power DRAM used as unified memory in Apple Silicon, DGX Spark, and Strix Halo. High capacity and efficiency, with bandwidth below HBM and GDDR.
- Megatron also in Training
NVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs.
- memory bandwidth also in Silicon
The rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once.
- neocloud also in Infrastructure
A specialized cloud provider focused exclusively on GPU and AI workloads, operating outside the traditional AWS/Azure/GCP hyperscaler perimeter, with CoreWeave, Lambda, and Voltage Park as the canonical examples.
- NVLink
NVIDIA's proprietary GPU-to-GPU interconnect, providing bandwidth an order of magnitude above PCIe and the basis for tightly-coupled 8-GPU server nodes (DGX, HGX).
- RDMA
A networking technique that lets a remote machine read or write local memory without involving the CPU, foundational for high-throughput distributed training over InfiniBand or RoCE.
- roofline also in Runtime
A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.
- scheduler
The component in a serving or training system that decides which work runs next, balancing throughput, fairness, latency targets, and resource constraints.
- sharding also in Training
A distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU.
- spot instance
A discounted cloud instance that the provider can reclaim with little warning, used for fault-tolerant training and batch inference where interruption is cheaper than reservation cost.
- throughput
The rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
- TPU also in Silicon
Google's custom AI accelerator family, used internally for training Gemini and externally via Google Cloud, designed around dense matrix multiplication with a systolic array architecture.
- unified memory also in Silicon
A single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark.
Data
10 terms
- BPE
A subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback.
- Common Crawl
A nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus.
- FineWeb
An open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering.
- Hugging Face also in Training
The model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
- OSAID also in Governance
The OSI's October 2024 definition of "open source AI," requiring not just weights but enough information about data, code, and architecture for third parties to reproduce the system.
- pretraining also in Training
The first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data.
- RedPajama
An early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset.
- The Pile
An 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on.
- tokenization
The process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor.
- tokenizer
The component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding.
Training
48 terms
- ALiBi also in Runtime
A positional encoding that adds a linear bias to attention scores based on the distance between tokens, with no learned position parameters and natural length extrapolation.
- alignment
The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.
- attention also in Runtime
The transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.
- Axolotl
An open YAML-driven fine-tuning framework that orchestrates Hugging Face Transformers, PEFT, TRL, and DeepSpeed for one-shot LoRA, QLoRA, and full fine-tuning workflows.
- benchmark also in Evaluation
A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.
- BF16 also in Silicon
A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.
- BPE also in Data
A subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback.
- constitutional AI also in Safety and Guardrails
Anthropic's alignment technique where a model is trained to critique and revise its own outputs against a written list of principles (the "constitution"), reducing the need for human ranking labels.
- context window also in Runtime
The maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks.
- CUDA also in Silicon
NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.
- decentralized training also in Sovereignty and Decentralization Primitives
Training a model across many independently-operated nodes that are not tightly coupled, contrasted with single-cluster training; the architecture for community-owned model production.
- DeepSeek also in Weights
A Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting.
- DeepSpeed
Microsoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale.
- dense also in Weights
A transformer where every parameter activates on every token; the conventional architecture before mixture of experts became common at frontier scale.
- DPO
A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.
- embedding also in Retrieval and Memory
A fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG.
- expert parallelism also in Runtime
A parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step.
- fine-tuning
Continued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch.
- FineWeb also in Data
An open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering.
- FP16 also in Silicon
A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.
- FP8 also in Silicon
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
- GQA also in Runtime
An attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention.
- Hugging Face
The model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
- knowledge distillation
A training technique where a small student model learns to mimic a larger teacher model's output distributions, transferring capability into a cheaper-to-serve form.
- LoRA
A parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model.
- Megatron
NVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs.
- mixture of experts also in Weights
A model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute.
- model FLOPs utilization also in Runtime
MFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode.
- Multi-LoRA inference also in Runtime
Serving many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies.
- NF4 also in Weights
A 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights.
- PEFT
A family of fine-tuning methods that update only a small fraction of a base model's parameters, making fine-tuning feasible on consumer hardware and storage-efficient at deployment.
- perplexity also in Evaluation
A measure of how well a language model predicts a text, equal to the exponential of the per-token cross-entropy loss; lower is better, often used for training diagnostics.
- post-training
Everything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant.
- pretraining
The first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data.
- QLoRA
A fine-tuning method that combines 4-bit quantization of the frozen base model with LoRA adapters, making large-model fine-tuning fit on a single consumer GPU.
- RedPajama also in Data
An early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset.
- RLHF
A post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning.
- RoPE also in Runtime
A positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position.
- scheduler also in Compute
The component in a serving or training system that decides which work runs next, balancing throughput, fairness, latency targets, and resource constraints.
- sharding
A distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU.
- tensor parallelism also in Runtime
A way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained.
- The Pile also in Data
An 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on.
- tokenization also in Data
The process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor.
- tokenizer also in Data
The component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding.
- transformer also in Runtime
The neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM.
- TRL
Hugging Face's library for preference and reinforcement learning on transformer models, the canonical open implementation of RLHF, DPO, KTO, ORPO, and related preference-tuning methods.
- Unsloth
An open fine-tuning library that uses hand-written Triton kernels and a manual gradient implementation to run LoRA and QLoRA fine-tuning roughly 2x faster than the Hugging Face baseline.
- YaRN also in Weights
A position-encoding extension technique that lets a RoPE-pretrained model handle context windows longer than its training length without quality collapse.
Weights
50 terms
- acceptable-use also in Governance
License or terms-of-service clauses that prohibit certain uses (weapons, surveillance, harassment, child sexual abuse material), common on open-weight licenses but rejected by the strict open-source definition.
- ALiBi also in Runtime
A positional encoding that adds a linear bias to attention scores based on the distance between tokens, with no learned position parameters and natural length extrapolation.
- Apache 2.0 also in Governance
A permissive open-source license used by most open-weight model releases (Llama from 4 onward partial, Qwen, Mistral, DeepSeek, Falcon), allowing commercial use without acceptable-use restrictions.
- attention also in Runtime
The transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.
- AWQ
A post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss.
- benchmark also in Evaluation
A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.
- BF16 also in Silicon
A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.
- context window also in Runtime
The maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks.
- DeepSeek
A Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting.
- dense
A transformer where every parameter activates on every token; the conventional architecture before mixture of experts became common at frontier scale.
- field-of-use also in Governance
License clauses that limit which industries or applications a model may be deployed in, restricting use to non-competitive, non-commercial, or non-government purposes.
- fine-tuning also in Training
Continued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch.
- FP16 also in Silicon
A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.
- FP4 also in Silicon
A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.
- FP8 also in Silicon
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
- frontier
The current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold.
- Gemma
Google's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook.
- GGUF
A binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023.
- GPTQ
A post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data.
- GQA also in Runtime
An attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention.
- Hugging Face also in Training
The model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
- hybrid attention also in Runtime
An attention design that interleaves different mechanisms across layers, typically global plus sliding-window, to combine quality with long-context efficiency.
- knowledge distillation also in Training
A training technique where a small student model learns to mimic a larger teacher model's output distributions, transferring capability into a cheaper-to-serve form.
- Llama
Meta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause.
- Llama Guard also in Safety and Guardrails
Meta's open content-moderation model line, designed to classify prompts and responses against a configurable taxonomy of harms, deployable as an input/output filter.
- LoRA also in Training
A parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model.
- MAU also in Governance
A user-count metric used in restrictive open-weights licenses (notably Llama's Community License) to trigger a requirement to negotiate a separate commercial license at scale.
- MHA also in Runtime
Standard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow.
- Mistral
A French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B).
- Mixtral
Mistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases.
- mixture of experts
A model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute.
- MLA also in Runtime
An attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache.
- MQA also in Runtime
An attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention.
- Multi-LoRA inference also in Runtime
Serving many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies.
- NF4
A 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights.
- open weights
A model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness.
- OSAID also in Governance
The OSI's October 2024 definition of "open source AI," requiring not just weights but enough information about data, code, and architecture for third parties to reproduce the system.
- PEFT also in Training
A family of fine-tuning methods that update only a small fraction of a base model's parameters, making fine-tuning feasible on consumer hardware and storage-efficient at deployment.
- post-training also in Training
Everything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant.
- pretraining also in Training
The first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data.
- QLoRA also in Training
A fine-tuning method that combines 4-bit quantization of the frozen base model with LoRA adapters, making large-model fine-tuning fit on a single consumer GPU.
- quantization
Storing or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss.
- Qwen
Alibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters.
- RoPE also in Runtime
A positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position.
- sliding window attention also in Runtime
An attention pattern where each token attends only to a fixed window of recent tokens, trading global lookup for linear-cost inference at long sequence lengths.
- source-available
A license category that lets users read and modify the code or weights but imposes restrictions (use limits, non-compete, MAU thresholds) that exclude it from the strict open-source definition.
- state space model
An alternative to attention that processes sequences via a learned linear recurrence; scales linearly with sequence length where attention scales quadratically.
- transformer also in Runtime
The neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM.
- VRAM math
The first-pass formula for whether a model fits on a GPU. VRAM ≈ parameters × (bits ÷ 8), plus 10-30 percent for KV cache, activations, and overhead.
- YaRN
A position-encoding extension technique that lets a RoPE-pretrained model handle context windows longer than its training length without quality collapse.
Runtime
73 terms
- ALiBi
A positional encoding that adds a linear bias to attention scores based on the distance between tokens, with no learned position parameters and natural length extrapolation.
- arithmetic intensity
FLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity.
- attention
The transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.
- AWQ also in Weights
A post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss.
- batching also in Compute
Grouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput.
- BF16 also in Silicon
A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.
- confidential computing also in Identity and Trust
The umbrella category of compute architectures where workloads run isolated from the host operator, combining hardware TEEs, attestation, and encrypted-memory protections.
- context window
The maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks.
- continuous batching
A request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete.
- CUDA also in Silicon
NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.
- decode
The second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute.
- dense also in Weights
A transformer where every parameter activates on every token; the conventional architecture before mixture of experts became common at frontier scale.
- expert parallelism
A parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step.
- FlashAttention
An exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss.
- FP16 also in Silicon
A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.
- FP4 also in Silicon
A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.
- FP8 also in Silicon
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
- GGUF also in Weights
A binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023.
- GPTQ also in Weights
A post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data.
- GPU also in Silicon
A massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
- GQA
An attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention.
- Groq also in Silicon
An AI inference company with custom deterministic LPU chips and a hosted inference service that achieves extremely low time-per-token (1000+ tokens/sec on 70B models).
- hybrid attention
An attention design that interleaves different mechanisms across layers, typically global plus sliding-window, to combine quality with long-context efficiency.
- inference
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
- KV cache
The stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix.
- latency also in Compute
The time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
- llama.cpp
Georgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products.
- LM Studio
A desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools.
- local-first also in Sovereignty and Decentralization Primitives
An architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation.
- memory bandwidth also in Silicon
The rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once.
- MHA
Standard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow.
- Mixtral also in Weights
Mistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases.
- mixture of experts also in Weights
A model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute.
- MLA
An attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache.
- MLX
Apple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
- model bandwidth utilization
MBU is the fraction of an accelerator's peak memory bandwidth a serving stack actually reaches during decode. Real systems land around 60 to 85 percent.
- model FLOPs utilization
MFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode.
- MQA
An attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention.
- Multi-LoRA inference
Serving many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies.
- NF4 also in Weights
A 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights.
- Ollama
A local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine.
- on-device also in Sovereignty and Decentralization Primitives
Running model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX.
- ONNX
An open interchange format for machine learning models, designed to let a model trained in one framework run in another via a portable graph representation.
- PagedAttention
An attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently.
- Petals also in Sovereignty and Decentralization Primitives
A volunteer-pooled inference system that runs large open-weight models across many internet-connected nodes, each holding a slice of the model, with users dispatching forward passes through the swarm.
- prefill
The first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens.
- prefix caching
A serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute.
- quantization also in Weights
Storing or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss.
- RadixAttention
A KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state.
- ROCm also in Silicon
AMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling.
- roofline
A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.
- RoPE
A positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position.
- scheduler also in Compute
The component in a serving or training system that decides which work runs next, balancing throughput, fairness, latency targets, and resource constraints.
- SGLang
An open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads.
- sliding window attention
An attention pattern where each token attends only to a fixed window of recent tokens, trading global lookup for linear-cost inference at long sequence lengths.
- speculative decoding
An inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss.
- state space model also in Weights
An alternative to attention that processes sequences via a learned linear recurrence; scales linearly with sequence length where attention scales quadratically.
- tensor parallelism
A way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained.
- TensorRT-LLM
NVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA.
- TGI
Hugging Face's production inference server, an early peer of vLLM that ceded throughput leadership in 2024 and now sits in maintenance mode behind vLLM and SGLang.
- throughput also in Compute
The rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
- tokenization also in Data
The process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor.
- tokenizer also in Data
The component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding.
- tokens per second
The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.
- TPOT
Time per output token. The latency between successive tokens during decode; tracks memory bandwidth and concurrent batch size more than peak compute.
- transformer
The neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM.
- TTFT
Time to first token. The latency from request received to the first output token streamed back; dominated by prompt-prefill cost and scheduler queueing.
- unified memory also in Silicon
A single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark.
- verifiable inference also in Identity and Trust
An inference architecture that provides cryptographic proof the claimed model produced the claimed output, via TEE attestation, zero-knowledge proofs (ZKML), or proof-of-sample-correctness schemes.
- vLLM
An open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load.
- VRAM math also in Weights
The first-pass formula for whether a model fits on a GPU. VRAM ≈ parameters × (bits ÷ 8), plus 10-30 percent for KV cache, activations, and overhead.
- YaRN also in Weights
A position-encoding extension technique that lets a RoPE-pretrained model handle context windows longer than its training length without quality collapse.
- ZKML also in Identity and Trust
Zero-knowledge proofs of correct machine-learning inference, letting a prover convince a verifier that a specific model produced a specific output without revealing model or input.
Retrieval and Memory
11 terms
- agent memory
The persistent state an agent carries across turns and sessions, ranging from session-scoped scratchpads to long-term knowledge bases the agent reads and writes itself.
- BM25
A classical lexical ranking function for information retrieval, based on term frequency and inverse document frequency with saturation, still the strong lexical baseline for hybrid search.
- chunking
Splitting source documents into smaller passages for embedding and retrieval, where the chunk size and overlap directly affect retrieval quality and context efficiency.
- ColBERT
A retrieval model that produces per-token embeddings for documents and queries, then ranks by summing the maximum similarity across query tokens, more accurate than single-vector retrieval.
- embedding
A fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG.
- LangChain also in Agents
The earliest widely-adopted LLM agent and RAG orchestration framework (2022), now with the LangGraph extension for stateful multi-step agent workflows.
- LlamaIndex
An open-source RAG framework focused on connecting LLMs to external data, with strong document-ingestion tooling and a smaller surface area than LangChain.
- RAG
A pattern where a model retrieves relevant documents from an external store at query time and conditions its answer on them, instead of relying only on parametric knowledge.
- reranking
A second-pass scoring step that takes the top-k candidates from initial retrieval and rescores them with a more expensive but more accurate cross-encoder model.
- semantic search
Search that matches by meaning rather than literal terms, using embeddings to rank results by similarity to the query's intent rather than its surface tokens.
- vector database
A datastore optimized for approximate nearest-neighbor search over high-dimensional embedding vectors, the storage substrate for most RAG and recommendation pipelines.
Agents
21 terms
- A2A also in Protocols
A Google-launched open protocol for agent-to-agent communication, letting agents from different vendors discover each other's capabilities and exchange structured messages.
- agent memory also in Retrieval and Memory
The persistent state an agent carries across turns and sessions, ranging from session-scoped scratchpads to long-term knowledge bases the agent reads and writes itself.
- agentic
An informal descriptor for AI systems that pursue multi-step goals via tool use, planning, and self-correction, rather than single-turn question-answering.
- AutoGen
A Microsoft Research framework for multi-agent systems, with a conversation-pattern API for orchestrating multiple specialized agents to solve tasks collaboratively.
- chain of thought
A prompting and training technique where the model emits step-by-step intermediate reasoning before its final answer, improving accuracy on multi-step problems.
- function calling
A pattern where a model emits a structured call (function name plus arguments), the runtime executes it, and the result returns as input on the model's next turn.
- Goose
Block's open-source coding agent, BYOK across multiple model providers, with MCP support and a permissive license; the most cited fully-open agent platform in 2026.
- inference also in Runtime
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
- LangChain
The earliest widely-adopted LLM agent and RAG orchestration framework (2022), now with the LangGraph extension for stateful multi-step agent workflows.
- LlamaIndex also in Retrieval and Memory
An open-source RAG framework focused on connecting LLMs to external data, with strong document-ingestion tooling and a smaller surface area than LangChain.
- local-first also in Sovereignty and Decentralization Primitives
An architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation.
- MCP also in Protocols
An open protocol from Anthropic that standardizes how language models discover and call external tools, data sources, and prompts via a small JSON-RPC interface.
- multi-agent
Architectures where multiple LLM-driven agents collaborate or compete on a task, each with its own role, prompt, or specialization, coordinated by an orchestrator or message-passing protocol.
- NeMo Guardrails also in Safety and Guardrails
NVIDIA's open framework for programmable safety, topic, and conversation guardrails around LLM applications, using a Colang DSL to define allowed and disallowed conversation flows.
- prompt injection also in Safety and Guardrails
An attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt.
- RAG also in Retrieval and Memory
A pattern where a model retrieves relevant documents from an external store at query time and conditions its answer on them, instead of relying only on parametric knowledge.
- ReAct
An agent loop where the model alternates between reasoning steps (thought) and acting steps (tool call), explicitly interleaving free-form deliberation with structured tool use.
- SGLang also in Runtime
An open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads.
- tool use
The general pattern of an LLM invoking external functions, APIs, or systems to fetch data or take action, distinct from generating an answer purely from its weights.
- tree of thoughts
A prompting pattern that has the model generate and evaluate multiple branching reasoning paths, then select or backtrack rather than committing to a single chain of thought.
- vLLM also in Runtime
An open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load.
Protocols
9 terms
- A2A
A Google-launched open protocol for agent-to-agent communication, letting agents from different vendors discover each other's capabilities and exchange structured messages.
- agentic also in Agents
An informal descriptor for AI systems that pursue multi-step goals via tool use, planning, and self-correction, rather than single-turn question-answering.
- agentic payments
The class of payment flows initiated and settled by autonomous AI agents on a user's behalf, distinct from human-initiated checkout flows.
- decentralized GPU marketplace also in Infrastructure
A protocol market matching GPU supply from many independent providers to AI demand, settled on a token rail; Akash, io.net, Bittensor compute, and Hyperbolic are canonical.
- function calling also in Agents
A pattern where a model emits a structured call (function name plus arguments), the runtime executes it, and the result returns as input on the model's next turn.
- L402
A Lightning-Labs protocol that pairs HTTP 402 Payment Required with Lightning Network invoices, enabling sub-cent metered payments for APIs and content.
- MCP
An open protocol from Anthropic that standardizes how language models discover and call external tools, data sources, and prompts via a small JSON-RPC interface.
- tool use also in Agents
The general pattern of an LLM invoking external functions, APIs, or systems to fetch data or take action, distinct from generating an answer purely from its weights.
- x402
An open protocol revived by Coinbase in 2025 that uses the long-reserved HTTP 402 "Payment Required" status to let agents and APIs settle micropayments, including in stablecoins.
Evaluation
16 terms
- benchmark
A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.
- chain of thought also in Agents
A prompting and training technique where the model emits step-by-step intermediate reasoning before its final answer, improving accuracy on multi-step problems.
- frontier also in Weights
The current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold.
- hallucination also in Safety and Guardrails
A model output that is fluent and plausible-sounding but factually wrong, ranging from invented citations and APIs to fabricated names, dates, and quotes.
- Hugging Face also in Training
The model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
- HumanEval
An OpenAI benchmark of 164 Python programming problems scored by whether unit tests pass, the default LLM-coding benchmark from 2021 until saturation in 2024.
- inference also in Runtime
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
- latency also in Compute
The time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
- leaderboard
A ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived.
- lm-eval-harness
EleutherAI's open-source evaluation framework that runs hundreds of standardized benchmarks against any Hugging Face or OpenAI-compatible model, the de facto reference harness behind the Open LLM Leaderboard.
- MMLU
A multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy.
- perplexity
A measure of how well a language model predicts a text, equal to the exponential of the per-token cross-entropy loss; lower is better, often used for training diagnostics.
- throughput also in Compute
The rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
- tokens per second also in Runtime
The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.
- TPOT also in Runtime
Time per output token. The latency between successive tokens during decode; tracks memory bandwidth and concurrent batch size more than peak compute.
- TTFT also in Runtime
Time to first token. The latency from request received to the first output token streamed back; dominated by prompt-prefill cost and scheduler queueing.
Governance
19 terms
- acceptable-use
License or terms-of-service clauses that prohibit certain uses (weapons, surveillance, harassment, child sexual abuse material), common on open-weight licenses but rejected by the strict open-source definition.
- AGPL
A strong-copyleft license that extends GPL's source-distribution requirement to network-served software, the strongest open-source license to deter proprietary SaaS deployment.
- alignment also in Training
The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.
- Apache 2.0
A permissive open-source license used by most open-weight model releases (Llama from 4 onward partial, Qwen, Mistral, DeepSeek, Falcon), allowing commercial use without acceptable-use restrictions.
- field-of-use
License clauses that limit which industries or applications a model may be deployed in, restricting use to non-competitive, non-commercial, or non-government purposes.
- frontier also in Weights
The current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold.
- Gemma also in Weights
Google's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook.
- grid interconnect queue also in Infrastructure
The regulatory queue a new generation or load project must traverse to connect to the grid; currently the binding constraint on how fast gigawatt-class AI sites come online.
- hyperscaler capex also in Infrastructure
The capital expenditure that Microsoft, Google, Amazon, Meta, and Oracle are spending on AI infrastructure, totaling roughly $300B+ annually by 2025-2026 and dominating the supply-and-demand signal for the entire stack.
- Llama also in Weights
Meta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause.
- MAU
A user-count metric used in restrictive open-weights licenses (notably Llama's Community License) to trigger a requirement to negotiate a separate commercial license at scale.
- ONNX also in Runtime
An open interchange format for machine learning models, designed to let a model trained in one framework run in another via a portable graph representation.
- open weights also in Weights
A model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness.
- OSAID
The OSI's October 2024 definition of "open source AI," requiring not just weights but enough information about data, code, and architecture for third parties to reproduce the system.
- OSI
The nonprofit that maintains the canonical Open Source Definition for software since 1998, and the OSAID definition for AI as of 2024.
- RISC-V also in Silicon
An open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86.
- RLHF also in Training
A post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning.
- source-available also in Weights
A license category that lets users read and modify the code or weights but imposes restrictions (use limits, non-compete, MAU thresholds) that exclude it from the strict open-source definition.
- sovereign compute also in Infrastructure
AI compute capacity owned, operated, or contractually controlled by a nation-state for the use of its own institutions and citizens, distinct from rented capacity on US-hyperscaler clouds.
Identity and Trust
6 terms
- attestation
A cryptographic protocol that lets a remote party verify which code is running inside a TEE, including which model is loaded and which build of the inference engine.
- confidential computing
The umbrella category of compute architectures where workloads run isolated from the host operator, combining hardware TEEs, attestation, and encrypted-memory protections.
- SGX
Intel's earliest mainstream trusted execution environment, the predecessor to TDX, with smaller enclave sizes and a history of side-channel vulnerabilities that limited its deployment for AI.
- TEE
A hardware-isolated CPU region where code and data are protected from inspection by the host OS, used to run inference in a way the operator cannot read or modify.
- verifiable inference
An inference architecture that provides cryptographic proof the claimed model produced the claimed output, via TEE attestation, zero-knowledge proofs (ZKML), or proof-of-sample-correctness schemes.
- ZKML
Zero-knowledge proofs of correct machine-learning inference, letting a prover convince a verifier that a specific model produced a specific output without revealing model or input.
Safety and Guardrails
11 terms
- acceptable-use also in Governance
License or terms-of-service clauses that prohibit certain uses (weapons, surveillance, harassment, child sexual abuse material), common on open-weight licenses but rejected by the strict open-source definition.
- alignment also in Training
The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.
- constitutional AI
Anthropic's alignment technique where a model is trained to critique and revise its own outputs against a written list of principles (the "constitution"), reducing the need for human ranking labels.
- DPO also in Training
A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.
- hallucination
A model output that is fluent and plausible-sounding but factually wrong, ranging from invented citations and APIs to fabricated names, dates, and quotes.
- jailbreak
An adversarial input that causes a model to bypass its training-time refusal policy and produce content it would normally refuse, distinct from prompt injection's instruction-hijack.
- Llama Guard
Meta's open content-moderation model line, designed to classify prompts and responses against a configurable taxonomy of harms, deployable as an input/output filter.
- NeMo Guardrails
NVIDIA's open framework for programmable safety, topic, and conversation guardrails around LLM applications, using a Colang DSL to define allowed and disallowed conversation flows.
- post-training also in Training
Everything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant.
- prompt injection
An attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt.
- RLHF also in Training
A post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning.
Sovereignty and Decentralization Primitives
17 terms
- agentic payments also in Protocols
The class of payment flows initiated and settled by autonomous AI agents on a user's behalf, distinct from human-initiated checkout flows.
- decentralized GPU marketplace also in Infrastructure
A protocol market matching GPU supply from many independent providers to AI demand, settled on a token rail; Akash, io.net, Bittensor compute, and Hyperbolic are canonical.
- decentralized training
Training a model across many independently-operated nodes that are not tightly coupled, contrasted with single-cluster training; the architecture for community-owned model production.
- L402 also in Protocols
A Lightning-Labs protocol that pairs HTTP 402 Payment Required with Lightning Network invoices, enabling sub-cent metered payments for APIs and content.
- llama.cpp also in Runtime
Georgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products.
- LM Studio also in Runtime
A desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools.
- local-first
An architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation.
- MLX also in Runtime
Apple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
- nuclear PPA also in Infrastructure
A long-term contract under which an AI operator commits to buy a defined nuclear-generated power output, becoming the cornerstone financing mechanism for gigawatt-class AI buildouts.
- Ollama also in Runtime
A local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine.
- on-device
Running model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX.
- Petals
A volunteer-pooled inference system that runs large open-weight models across many internet-connected nodes, each holding a slice of the model, with users dispatching forward passes through the swarm.
- RISC-V also in Silicon
An open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86.
- sovereign compute also in Infrastructure
AI compute capacity owned, operated, or contractually controlled by a nation-state for the use of its own institutions and citizens, distinct from rented capacity on US-hyperscaler clouds.
- Tenstorrent also in Silicon
An AI accelerator startup designing RISC-V-based chips (Wormhole, Blackhole, Grendel) with an open software stack, positioned as the leading open alternative to NVIDIA at the silicon layer.
- x402 also in Protocols
An open protocol revived by Coinbase in 2025 that uses the long-reserved HTTP 402 "Payment Required" status to let agents and APIs settle micropayments, including in stablecoins.
- ZKML also in Identity and Trust
Zero-knowledge proofs of correct machine-learning inference, letting a prover convince a verifier that a specific model produced a specific output without revealing model or input.
- A2A Protocols
A Google-launched open protocol for agent-to-agent communication, letting agents from different vendors discover each other's capabilities and exchange structured messages.
- acceptable-use Governance
License or terms-of-service clauses that prohibit certain uses (weapons, surveillance, harassment, child sexual abuse material), common on open-weight licenses but rejected by the strict open-source definition.
- agent memory Retrieval and Memory
The persistent state an agent carries across turns and sessions, ranging from session-scoped scratchpads to long-term knowledge bases the agent reads and writes itself.
- agentic Agents
An informal descriptor for AI systems that pursue multi-step goals via tool use, planning, and self-correction, rather than single-turn question-answering.
- agentic payments Protocols
The class of payment flows initiated and settled by autonomous AI agents on a user's behalf, distinct from human-initiated checkout flows.
- AGPL Governance
A strong-copyleft license that extends GPL's source-distribution requirement to network-served software, the strongest open-source license to deter proprietary SaaS deployment.
- AI factory Infrastructure
A purpose-built data center optimized for AI training rather than general cloud workloads, characterized by liquid-cooled high-density GPU racks, gigawatt-scale single-tenant power, and tightly-coupled networking.
- ALiBi Runtime
A positional encoding that adds a linear bias to attention scores based on the distance between tokens, with no learned position parameters and natural length extrapolation.
- alignment Training
The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.
- Apache 2.0 Governance
A permissive open-source license used by most open-weight model releases (Llama from 4 onward partial, Qwen, Mistral, DeepSeek, Falcon), allowing commercial use without acceptable-use restrictions.
- arithmetic intensity Runtime
FLOPs performed per byte read from memory. Low intensity means an operation is memory-bound; high intensity means compute-bound. LLM decode has very low intensity.
- attention Runtime
The transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.
- attestation Identity and Trust
A cryptographic protocol that lets a remote party verify which code is running inside a TEE, including which model is loaded and which build of the inference engine.
- AutoGen Agents
A Microsoft Research framework for multi-agent systems, with a conversation-pattern API for orchestrating multiple specialized agents to solve tasks collaboratively.
- AWQ Weights
A post-training quantization method that protects the small fraction of weight channels that handle the largest activations, achieving 4-bit weights with little quality loss.
- Axolotl Training
An open YAML-driven fine-tuning framework that orchestrates Hugging Face Transformers, PEFT, TRL, and DeepSpeed for one-shot LoRA, QLoRA, and full fine-tuning workflows.
- batching Compute
Grouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput.
- behind-the-meter Infrastructure
A power arrangement where generation sits on the same side of the utility meter as the load, letting a data center draw directly from the plant and bypass the grid.
- benchmark Evaluation
A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.
- BF16 Silicon
A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.
- BM25 Retrieval and Memory
A classical lexical ranking function for information retrieval, based on term frequency and inverse document frequency with saturation, still the strong lexical baseline for hybrid search.
- BPE Data
A subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback.
- Cerebras Silicon
An AI compute company built around wafer-scale chips (the WSE-3 is a single die covering most of a 300mm wafer), offering some of the lowest inference latency on the market.
- chain of thought Agents
A prompting and training technique where the model emits step-by-step intermediate reasoning before its final answer, improving accuracy on multi-step problems.
- chunking Retrieval and Memory
Splitting source documents into smaller passages for embedding and retrieval, where the chunk size and overlap directly affect retrieval quality and context efficiency.
- ColBERT Retrieval and Memory
A retrieval model that produces per-token embeddings for documents and queries, then ranks by summing the maximum similarity across query tokens, more accurate than single-vector retrieval.
- Common Crawl Data
A nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus.
- confidential computing Identity and Trust
The umbrella category of compute architectures where workloads run isolated from the host operator, combining hardware TEEs, attestation, and encrypted-memory protections.
- constitutional AI Safety and Guardrails
Anthropic's alignment technique where a model is trained to critique and revise its own outputs against a written list of principles (the "constitution"), reducing the need for human ranking labels.
- context window Runtime
The maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks.
- continuous batching Runtime
A request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete.
- CUDA Silicon
NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.
- decentralized GPU marketplace Infrastructure
A protocol market matching GPU supply from many independent providers to AI demand, settled on a token rail; Akash, io.net, Bittensor compute, and Hyperbolic are canonical.
- decentralized training Sovereignty and Decentralization Primitives
Training a model across many independently-operated nodes that are not tightly coupled, contrasted with single-cluster training; the architecture for community-owned model production.
- decode Runtime
The second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute.
- DeepSeek Weights
A Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting.
- DeepSpeed Training
Microsoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale.
- dense Weights
A transformer where every parameter activates on every token; the conventional architecture before mixture of experts became common at frontier scale.
- direct-to-chip cooling Infrastructure
A cooling architecture that pipes liquid coolant directly to a cold plate on each processor, evacuating the 700+ watts per GPU that air cooling cannot handle.
- DPO Training
A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.
- embedding Retrieval and Memory
A fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG.
- expert parallelism Runtime
A parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step.
- field-of-use Governance
License clauses that limit which industries or applications a model may be deployed in, restricting use to non-competitive, non-commercial, or non-government purposes.
- fine-tuning Training
Continued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch.
- FineWeb Data
An open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering.
- FlashAttention Runtime
An exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss.
- FP16 Silicon
A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.
- FP4 Silicon
A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.
- FP8 Silicon
An 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads.
- frontier Weights
The current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold.
- function calling Agents
A pattern where a model emits a structured call (function name plus arguments), the runtime executes it, and the result returns as input on the model's next turn.
- GDDR7 Silicon
The graphics memory generation on 2025-era consumer and workstation GPUs such as the RTX 5090 and RTX PRO 6000. High bandwidth per board, lower capacity than HBM.
- Gemma Weights
Google's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook.
- GGUF Weights
A binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023.
- gigawatt-class cluster Infrastructure
An AI training facility whose power draw is measured in gigawatts rather than megawatts, the scale at which siting decisions become grid-and-permitting problems rather than real-estate ones.
- Goose Agents
Block's open-source coding agent, BYOK across multiple model providers, with MCP support and a permissive license; the most cited fully-open agent platform in 2026.
- GPTQ Weights
A post-training quantization method that compresses transformer weights to 3 or 4 bits layer-by-layer with one-shot optimization against calibration data.
- GPU Silicon
A massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
- GQA Runtime
An attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention.
- grid interconnect queue Infrastructure
The regulatory queue a new generation or load project must traverse to connect to the grid; currently the binding constraint on how fast gigawatt-class AI sites come online.
- Groq Silicon
An AI inference company with custom deterministic LPU chips and a hosted inference service that achieves extremely low time-per-token (1000+ tokens/sec on 70B models).
- hallucination Safety and Guardrails
A model output that is fluent and plausible-sounding but factually wrong, ranging from invented citations and APIs to fabricated names, dates, and quotes.
- HBM Silicon
Stacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB.
- Hugging Face Training
The model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
- HumanEval Evaluation
An OpenAI benchmark of 164 Python programming problems scored by whether unit tests pass, the default LLM-coding benchmark from 2021 until saturation in 2024.
- hybrid attention Runtime
An attention design that interleaves different mechanisms across layers, typically global plus sliding-window, to combine quality with long-context efficiency.
- hyperscaler capex Infrastructure
The capital expenditure that Microsoft, Google, Amazon, Meta, and Oracle are spending on AI infrastructure, totaling roughly $300B+ annually by 2025-2026 and dominating the supply-and-demand signal for the entire stack.
- inference Runtime
Running a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training.
- InfiniBand Compute
A high-throughput, low-latency network fabric (Mellanox, now NVIDIA) used for inter-node communication in AI training clusters, supporting RDMA for direct GPU-to-GPU transfer across machines.
- jailbreak Safety and Guardrails
An adversarial input that causes a model to bypass its training-time refusal policy and produce content it would normally refuse, distinct from prompt injection's instruction-hijack.
- knowledge distillation Training
A training technique where a small student model learns to mimic a larger teacher model's output distributions, transferring capability into a cheaper-to-serve form.
- KV cache Runtime
The stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix.
- L402 Protocols
A Lightning-Labs protocol that pairs HTTP 402 Payment Required with Lightning Network invoices, enabling sub-cent metered payments for APIs and content.
- LangChain Agents
The earliest widely-adopted LLM agent and RAG orchestration framework (2022), now with the LangGraph extension for stateful multi-step agent workflows.
- latency Compute
The time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
- leaderboard Evaluation
A ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived.
- Llama Weights
Meta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause.
- Llama Guard Safety and Guardrails
Meta's open content-moderation model line, designed to classify prompts and responses against a configurable taxonomy of harms, deployable as an input/output filter.
- llama.cpp Runtime
Georgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products.
- LlamaIndex Retrieval and Memory
An open-source RAG framework focused on connecting LLMs to external data, with strong document-ingestion tooling and a smaller surface area than LangChain.
- LM Studio Runtime
A desktop application for running open-weight models locally with a GUI, model browser, and OpenAI-compatible local server, targeting users who prefer apps over command-line tools.
- lm-eval-harness Evaluation
EleutherAI's open-source evaluation framework that runs hundreds of standardized benchmarks against any Hugging Face or OpenAI-compatible model, the de facto reference harness behind the Open LLM Leaderboard.
- local-first Sovereignty and Decentralization Primitives
An architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation.
- LoRA Training
A parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model.
- LPDDR5X Silicon
Low-power DRAM used as unified memory in Apple Silicon, DGX Spark, and Strix Halo. High capacity and efficiency, with bandwidth below HBM and GDDR.
- MAU Governance
A user-count metric used in restrictive open-weights licenses (notably Llama's Community License) to trigger a requirement to negotiate a separate commercial license at scale.
- MCP Protocols
An open protocol from Anthropic that standardizes how language models discover and call external tools, data sources, and prompts via a small JSON-RPC interface.
- Megatron Training
NVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs.
- memory bandwidth Silicon
The rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once.
- MHA Runtime
Standard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow.
- Mistral Weights
A French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B).
- Mixtral Weights
Mistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases.
- mixture of experts Weights
A model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute.
- MLA Runtime
An attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache.
- MLX Runtime
Apple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone.
- MMLU Evaluation
A multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy.
- model bandwidth utilization Runtime
MBU is the fraction of an accelerator's peak memory bandwidth a serving stack actually reaches during decode. Real systems land around 60 to 85 percent.
- model FLOPs utilization Runtime
MFU is the fraction of an accelerator's peak compute a workload actually achieves. The compute-bound analogue of MBU, relevant to prefill and training, not memory-bound decode.
- MQA Runtime
An attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention.
- multi-agent Agents
Architectures where multiple LLM-driven agents collaborate or compete on a task, each with its own role, prompt, or specialization, coordinated by an orchestrator or message-passing protocol.
- Multi-LoRA inference Runtime
Serving many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies.
- NeMo Guardrails Safety and Guardrails
NVIDIA's open framework for programmable safety, topic, and conversation guardrails around LLM applications, using a Colang DSL to define allowed and disallowed conversation flows.
- neocloud Infrastructure
A specialized cloud provider focused exclusively on GPU and AI workloads, operating outside the traditional AWS/Azure/GCP hyperscaler perimeter, with CoreWeave, Lambda, and Voltage Park as the canonical examples.
- NF4 Weights
A 4-bit normal-float quantization format from the QLoRA paper. The 16 quantization levels are spaced to match the empirical distribution of pretrained weights.
- nuclear PPA Infrastructure
A long-term contract under which an AI operator commits to buy a defined nuclear-generated power output, becoming the cornerstone financing mechanism for gigawatt-class AI buildouts.
- NVLink Compute
NVIDIA's proprietary GPU-to-GPU interconnect, providing bandwidth an order of magnitude above PCIe and the basis for tightly-coupled 8-GPU server nodes (DGX, HGX).
- Ollama Runtime
A local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine.
- on-device Sovereignty and Decentralization Primitives
Running model inference on the user's local hardware (phone, laptop, embedded device), enabled by smaller models, FP8 quantization, and runtimes like llama.cpp and MLX.
- ONNX Runtime
An open interchange format for machine learning models, designed to let a model trained in one framework run in another via a portable graph representation.
- open weights Weights
A model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness.
- OSAID Governance
The OSI's October 2024 definition of "open source AI," requiring not just weights but enough information about data, code, and architecture for third parties to reproduce the system.
- OSI Governance
The nonprofit that maintains the canonical Open Source Definition for software since 1998, and the OSAID definition for AI as of 2024.
- PagedAttention Runtime
An attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently.
- PEFT Training
A family of fine-tuning methods that update only a small fraction of a base model's parameters, making fine-tuning feasible on consumer hardware and storage-efficient at deployment.
- perplexity Evaluation
A measure of how well a language model predicts a text, equal to the exponential of the per-token cross-entropy loss; lower is better, often used for training diagnostics.
- Petals Sovereignty and Decentralization Primitives
A volunteer-pooled inference system that runs large open-weight models across many internet-connected nodes, each holding a slice of the model, with users dispatching forward passes through the swarm.
- post-training Training
Everything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant.
- prefill Runtime
The first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens.
- prefix caching Runtime
A serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute.
- pretraining Training
The first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data.
- prompt injection Safety and Guardrails
An attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt.
- PUE Infrastructure
Power Usage Effectiveness, the ratio of total data center facility power to power delivered to IT equipment; lower is better, with 1.0 the floor and 1.1 a strong target.
- QLoRA Training
A fine-tuning method that combines 4-bit quantization of the frozen base model with LoRA adapters, making large-model fine-tuning fit on a single consumer GPU.
- quantization Weights
Storing or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss.
- Qwen Weights
Alibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters.
- RadixAttention Runtime
A KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state.
- RAG Retrieval and Memory
A pattern where a model retrieves relevant documents from an external store at query time and conditions its answer on them, instead of relying only on parametric knowledge.
- RDMA Compute
A networking technique that lets a remote machine read or write local memory without involving the CPU, foundational for high-throughput distributed training over InfiniBand or RoCE.
- ReAct Agents
An agent loop where the model alternates between reasoning steps (thought) and acting steps (tool call), explicitly interleaving free-form deliberation with structured tool use.
- RedPajama Data
An early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset.
- reranking Retrieval and Memory
A second-pass scoring step that takes the top-k candidates from initial retrieval and rescores them with a more expensive but more accurate cross-encoder model.
- RISC-V Silicon
An open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86.
- RLHF Training
A post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning.
- ROCm Silicon
AMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling.
- roofline Runtime
A performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity.
- RoPE Runtime
A positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position.
- scheduler Compute
The component in a serving or training system that decides which work runs next, balancing throughput, fairness, latency targets, and resource constraints.
- semantic search Retrieval and Memory
Search that matches by meaning rather than literal terms, using embeddings to rank results by similarity to the query's intent rather than its surface tokens.
- SGLang Runtime
An open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads.
- SGX Identity and Trust
Intel's earliest mainstream trusted execution environment, the predecessor to TDX, with smaller enclave sizes and a history of side-channel vulnerabilities that limited its deployment for AI.
- sharding Training
A distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU.
- sliding window attention Runtime
An attention pattern where each token attends only to a fixed window of recent tokens, trading global lookup for linear-cost inference at long sequence lengths.
- SMR Infrastructure
A nuclear reactor under ~300 MW per unit, factory-fabricated rather than site-built, positioned as the firm-power option for AI data centers needing new generation faster than conventional plants deliver.
- source-available Weights
A license category that lets users read and modify the code or weights but imposes restrictions (use limits, non-compete, MAU thresholds) that exclude it from the strict open-source definition.
- sovereign compute Infrastructure
AI compute capacity owned, operated, or contractually controlled by a nation-state for the use of its own institutions and citizens, distinct from rented capacity on US-hyperscaler clouds.
- speculative decoding Runtime
An inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss.
- spot instance Compute
A discounted cloud instance that the provider can reclaim with little warning, used for fault-tolerant training and batch inference where interruption is cheaper than reservation cost.
- state space model Weights
An alternative to attention that processes sequences via a learned linear recurrence; scales linearly with sequence length where attention scales quadratically.
- TEE Identity and Trust
A hardware-isolated CPU region where code and data are protected from inspection by the host OS, used to run inference in a way the operator cannot read or modify.
- tensor parallelism Runtime
A way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained.
- TensorRT-LLM Runtime
NVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA.
- Tenstorrent Silicon
An AI accelerator startup designing RISC-V-based chips (Wormhole, Blackhole, Grendel) with an open software stack, positioned as the leading open alternative to NVIDIA at the silicon layer.
- TGI Runtime
Hugging Face's production inference server, an early peer of vLLM that ceded throughput leadership in 2024 and now sits in maintenance mode behind vLLM and SGLang.
- The Pile Data
An 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on.
- throughput Compute
The rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics.
- tokenization Data
The process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor.
- tokenizer Data
The component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding.
- tokens per second Runtime
The headline inference speed metric. Decode tokens/sec is what a user feels as text streams; it is bounded by memory bandwidth divided by the bytes streamed per token.
- tool use Agents
The general pattern of an LLM invoking external functions, APIs, or systems to fetch data or take action, distinct from generating an answer purely from its weights.
- TPOT Runtime
Time per output token. The latency between successive tokens during decode; tracks memory bandwidth and concurrent batch size more than peak compute.
- TPU Silicon
Google's custom AI accelerator family, used internally for training Gemini and externally via Google Cloud, designed around dense matrix multiplication with a systolic array architecture.
- transformer Runtime
The neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM.
- tree of thoughts Agents
A prompting pattern that has the model generate and evaluate multiple branching reasoning paths, then select or backtrack rather than committing to a single chain of thought.
- TRL Training
Hugging Face's library for preference and reinforcement learning on transformer models, the canonical open implementation of RLHF, DPO, KTO, ORPO, and related preference-tuning methods.
- TTFT Runtime
Time to first token. The latency from request received to the first output token streamed back; dominated by prompt-prefill cost and scheduler queueing.
- unified memory Silicon
A single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark.
- Unsloth Training
An open fine-tuning library that uses hand-written Triton kernels and a manual gradient implementation to run LoRA and QLoRA fine-tuning roughly 2x faster than the Hugging Face baseline.
- vector database Retrieval and Memory
A datastore optimized for approximate nearest-neighbor search over high-dimensional embedding vectors, the storage substrate for most RAG and recommendation pipelines.
- verifiable inference Identity and Trust
An inference architecture that provides cryptographic proof the claimed model produced the claimed output, via TEE attestation, zero-knowledge proofs (ZKML), or proof-of-sample-correctness schemes.
- vLLM Runtime
An open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load.
- VRAM math Weights
The first-pass formula for whether a model fits on a GPU. VRAM ≈ parameters × (bits ÷ 8), plus 10-30 percent for KV cache, activations, and overhead.
- x402 Protocols
An open protocol revived by Coinbase in 2025 that uses the long-reserved HTTP 402 "Payment Required" status to let agents and APIs settle micropayments, including in stablecoins.
- YaRN Weights
A position-encoding extension technique that lets a RoPE-pretrained model handle context windows longer than its training length without quality collapse.
- ZKML Identity and Trust
Zero-knowledge proofs of correct machine-learning inference, letting a prover convince a verifier that a specific model produced a specific output without revealing model or input.