Glossary

tokenization

The process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor.

Data also: Training also: Runtime

The act of running text through the tokenizer and producing the sequence of token IDs the model will see. Every API call and every training-data ingestion pipeline runs this step. The output sequence length is what context-window limits measure: a 1M-token context window will hold roughly 750K English words, 500K words in non-English Latin scripts, and many fewer in some non-Latin scripts.

The interesting failure modes are language-skewed and notation-skewed. Languages whose writing systems were underrepresented in the tokenizerdataThe component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding. Open full entry ’s training data tokenize to many more tokens per character, which makes them more expensive per call and effectively shrinks the useful context. Some character classes (CJK, Indic scripts, emoji, specialized math notation) are particularly bad on tokenizers trained mostly on English.

Multilingual tokenizers (QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , Aya, BLOOM) address this with much larger vocabularies (200K+ tokens) and explicit coverage targets. The compute cost is small per call; the operational cost is in the model that has to learn over a larger embeddingretrieval-memoryA fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG. Open full entry table.

Sources

Byte Pair Encoding (Sennrich et al., 2016)

Back to glossary