Glossary

Cerebras

An AI compute company built around wafer-scale chips (the WSE-3 is a single die covering most of a 300mm wafer), offering some of the lowest inference latency on the market.

Silicon aka Cerebras Systems, CS-3

A different bet at the silicon layer. Where GPUs and most AI accelerators are die-sized, Cerebras’s Wafer-Scale Engine (WSE-3) is a single chip approximately the size of an entire 300mm wafer (46,225 square millimeters, ~900,000 cores). The architecture eliminates chip-to-chip communication for many workloads by keeping everything on one die.

The headline metric is inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry . Cerebras’s hosted inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry service produces tokens at hundreds to thousands per second on 70B-class models, far above GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry -based competitors. The trade-off is capital cost per system and a software stack that needs explicit support from frameworks (PyTorch with the Cerebras backend, the Cerebras compiler).

Sources

Cerebras CS-3 announcement

Back to glossary