The Open-Source AI Stack
RSS
← How LLMs work

02 Tokens and tokenizers

core

Models read integer token IDs, not words. Token counts, not word counts, set context limits and memory.

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

LLMs do not see raw text as words. They see tokenizationdataThe process of mapping raw text into the integer-ID sequences a model consumes, governed by the model's specific tokenizer; the rate-limiting interface between text and tensor. Open full entry : small chunks of text represented internally as integer IDs. A token might be a whole word, a word fragment (such as “inter”, “national”, “ization”), a punctuation mark, a whitespace-prefixed string, a byte-level fallback, or a special control marker such as a system or assistant tag.

The tokenizerdataThe component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding. Open full entry maps text to token IDs and token IDs back to text. Common families include BPEdataA subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback. Open full entry -style tokenizers and SentencePiece-style tokenizers. Different model families use different tokenizers, and that matters: a 4,000-word document might be 5,000 tokens in one tokenizer and 7,500 in another. That difference alone makes tokens-per-second comparisons across families imperfect.

Vocabulary size is a related tradeoff. A tokenizer with a larger vocabulary can compress some text into fewer tokens, but it also changes the size of the embeddingretrieval-memoryA fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG. Open full entry table and the output projection. There is no free choice here; the tokenizer is a design decision, not an afterthought.

Tokens are the unit that determines how much text fits in the context windowruntimeThe maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks. Open full entry , how large the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry grows, how much latency you pay during prompt processing, and whether multilingual or code-heavy text is handled efficiently. They also decide whether the model sees special chat markers correctly, which is its own failure mode covered later in this track.

A context window is the maximum number of tokens a model can attend to at once. In 2026, locally capable models range from 8K and 32K contexts up to 128K, 256K, and even 1M tokens on server-class systems. But supported length is not the same as cheap, fast, or equally accurate. A model that can technically handle 128K tokens may slow to a crawl at 64K and lose coherence past 100K. Test the lengths you actually plan to use. Once you treat tokens as the unit of work, long context stops looking magical and starts looking like a bill you can estimate.