The Open-Source AI Stack
RSS

Glossary

BPE

A subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback.

Data also: Training aka byte-pair encoding, byte pair encoding

An algorithm for learning a subword vocabulary. Starting from bytes (or characters), the procedure repeatedly counts adjacent token pairs in the training corpus, merges the most frequent pair into a new vocabulary entry, and continues until the vocabulary reaches the target size. The result is a vocabulary where common words become single tokens and rare words decompose into subword pieces.

GPT-2 popularized byte-level BPE as the default tokenizerdataThe component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding. Open full entry for language modeling. LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry , MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry , QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , and most modern open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry families use BPE variants. The main alternative, SentencePiece’s unigram model (Kudo, 2018), produces similar results via a different optimization criterion and is used in some Google-derived tokenizers.

BPE’s strength is graceful fallback: any unseen text decomposes to known subwords or bytes. Its weakness is that the merge decisions are fixed once trained; a tokenizerdataThe component that splits raw text into discrete units (tokens) the model can process, usually using a learned subword vocabulary like Byte-Pair Encoding. Open full entry trained mostly on English allocates most of its vocabulary to English at the expense of other languages. The 2024 to 2026 multilingual tokenizers (QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , Aya) use much larger BPE vocabularies trained on balanced multilingual corpora to mitigate this.

Sources

Mentioned in

Back to glossary