Glossary

Groq

An AI inference company with custom deterministic LPU chips and a hosted inference service that achieves extremely low time-per-token (1000+ tokens/sec on 70B models).

Silicon also: Runtime aka Groq LPU

inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry -only AI silicon. The Groq LPU (Language Processing Unit) is a deterministic-scheduling chip optimized for the autoregressive decode path: roughly 230 MB of on-die SRAM per chip with no external DRAM, a compiler-scheduled execution model that eliminates runtime variability. The result on hosted inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry is consistent thousand-tokens-per-second generation on 70B-class models.

The architectural bet is that pure inference workloads do not need training-style flexibility, and a chip dedicated to inference can beat general-purpose GPUs by an order of magnitude on the latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry -sensitive part. The trade-off is training: Groq does not target training workloads at all.

Sources

Groq

Mentioned in

Back to glossary