Glossary
TPU
Google's custom AI accelerator family, used internally for training Gemini and externally via Google Cloud, designed around dense matrix multiplication with a systolic array architecture.
Google’s in-house alternative to GPUs. The architecture centers on a large systolic array of multiply-accumulate units that dataflow optimally for matrix-matrix multiplication, with on-chip memory bigger than GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry caches and interconnect (ICI) designed for tightly-coupled pods of thousands of chips.
TPUs are the substrate Google uses to train Gemini and serve consumer AI products. They are available externally through Google Cloud (v5e, v5p, Trillium, Ironwood generations). The software stack is JAX and XLA rather than CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry , which is the main reason TPUs do not dominate beyond Google: most open-source training and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry code targets CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry first.
Worth knowing because the design choices in TPUs influence the broader silicon discussion: systolic arrays as the answer to “we only need matmul,” tightly-coupled pods as the answer to “model parallelism is the bottleneck,” and software-defined network topology as the answer to “we want to schedule trains and serves on the same fleet.”