The Open-Source AI Stack
RSS

Glossary

scheduler

The component in a serving or training system that decides which work runs next, balancing throughput, fairness, latency targets, and resource constraints.

Compute also: Runtime also: Training aka request scheduler, training scheduler

In an inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry runtime: the loop that picks which requests to include in the next batch, when to admit a new request, when to evict cache blocks. vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry and SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry both use continuous-batchingcomputeGrouping multiple requests or training examples into a single forward or backward pass, the lever that turns GPU compute density into throughput. Open full entry schedulers that make these decisions at the token level. The scheduler’s policy determines the throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry -vs-latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry curve of the whole engine.

In a training cluster: the workload manager (Slurm, Kubernetes, Ray, custom) that places jobs on physical hardware, handles preemption, and matches resource requests to availability. Hyperscaler training clusters use custom schedulers tuned for GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry -aware placement, gang-scheduling of distributed jobs, and fault-tolerance handoffs.

Scheduling design surfaces in user-visible ways. Cold-start latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry on a “scale to zero” serverless inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry endpoint comes from the scheduler waiting on hardware to provision. Long queue times on shared clusters come from the scheduler’s fairness policies. The scheduler is rarely the thing people demo, but it decides who runs and who waits.

Sources

Back to glossary