The Open-Source AI Stack
RSS

Glossary

tokenizer

The component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding.

The piece of the model pipeline that turns text into integer IDs. Modern tokenizers use a subword vocabulary (typically 32K to 256K tokens) learned by BPEdataA subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback. Open full entry (BPE) or a similar algorithm. Common words become single tokens; rare words are split into multiple subword tokens; arbitrary unicode falls back to byte-level encoding.

The tokenizer is part of the model: it must match between training and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry . The trained vocabulary determines how text maps to sequences of tokens, which determines context-length budgets (measured in tokens, not characters), per-language efficiency (non-English text often tokenizes to more tokens per character), and behavior on out-of-vocabulary input (rare scripts, code, technical notation).

Production tokenizers in 2026: tiktoken (OpenAI), LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry tokenizer family (Meta), QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry tokenizer, Anthropic’s tokenizer for Claude. The Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry tokenizers library implements all the common variants and is the open-ecosystem reference. Tokenizer choice has measurable effects on training efficiency and downstream behavior, so it is one of the highest-impact design decisions in a model release.

Sources

Mentioned in

Back to glossary