Glossary
Petals
A volunteer-pooled inference system that runs large open-weight models across many internet-connected nodes, each holding a slice of the model, with users dispatching forward passes through the swarm.
The canonical decentralized-inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry reference project. Petals splits a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry into stages, runs each stage on a different volunteer GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry connected over the public internet, and forwards activations between them. A user connects to the network as a client, which routes the forward pass through a sequence of nodes serving each stage.
The throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry is dramatically lower than centralized serving (limits imposed by inter-node latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry over consumer internet), but the architecture demonstrates that frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry -scale models can run without a single operator controlling the inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry . fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry over the swarm is also supported.
Petals is less a production-ready alternative to vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry and more a proof-of-concept for sovereignty-aligned inference architecture. The ideas (pipeline-parallel inference across loosely-connected nodes) inform a broader research stream of decentralized AI compute.