
Subword Tokenization Algorithms
Understanding the algorithms behind tokenization in Large Language Models.
/ 4 min read
Table of Contents
Byte Pair Encoding (BPE), Unigram, and WordPiece are all subword tokenization algorithms widely used in modern NLP models. Here’s a breakdown of how each works, along with similarities and differences:
1. Byte Pair Encoding (BPE)
Used in GPT (OpenAI), RoBERTa, BART, and many others.
-
How it works:
- Begins with a base vocabulary of all unique characters.
- Scans a large text corpus to find the most frequent pair of adjacent symbols (could be characters or already-merged subwords).
- Merges that pair into a new symbol (subword).
- Repeats merging until reaching a target vocabulary size or number of merges.
- When encoding new text, uses the longest possible subwords from its learned vocabulary.
-
Example:
- “Reading” → Initially [“R”, “e”, “a”, “d”, “i”, “n”, “g”]
- After merges: [“Read”, “ing”]
2. Unigram Language Model
Used in models like T5, XLNet, ALBERT, mBART.
-
How it works:
- Starts with a large set of potential subwords (including all possible splits).
- Assigns a probability to each subword.
- Uses statistical modeling (typically EM algorithm) to drop less probable subwords iteratively, shrinking the vocabulary.
- When encoding new text, finds the segmentation (split into subwords) that yields the highest probability given the trained model.
-
Example:
- Could segment “enjoyed” as [“en”, “joyed”] or [“enjoy”, “ed”], whichever is highest probability.
3. WordPiece
Originally developed for Google’s BERT and its variants.
-
How it works:
- Very similar to BPE; starts with individual characters as tokens.
- Merges pairs based on maximizing the likelihood of the training data, choosing merges that most increase the probability of the corpus.
- Used in BERT, originally developed for speech recognition.
-
Segmentation rules:
- Often uses a special marker (e.g., ##) for subwords that appear in the middle or end of a word (e.g., “unbelievable” → [“un”, “##believable”]).
Similarities
- All are subword tokenization algorithms—breaking words into chunks to efficiently handle rare or unknown terms and avoid out-of-vocabulary errors.
- All construct a finite vocabulary of subwords during training, usable to encode any input text.
- All support language diversity, rare word handling, and compression.
Differences
Feature | BPE | Unigram | WordPiece |
---|---|---|---|
Merge Rule | Most frequent pairs | Probabilistic model; prune low | Most likely pairs (likelihood) |
Algorithm Type | Deterministic, greedy merges | Probabilistic, EM optimization | Deterministic but stat-based |
Encoding | Greedly uses longest subwords | Finds best split (max prob) | Greedily uses longest subwords |
Vocabulary Gen. | All merges added | Prunes unnecessary subwords | Merges for likelihood |
Output variety | Always same for input | May have multiple valid splits | Same input ≈ same output |
Marker Usage | No special symbol | No special symbol | Special marker (”##”) |
- BPE and WordPiece are similar in being greedy merge-based approaches, but WordPiece merges based on maximizing training data likelihood rather than mere frequency.
- Unigram differs by keeping a probabilistic model of all possible subword splits and pruning; for encoding, it chooses the most probable splitting, not just the longest matches.
Benchmark
Recent benchmarks and analyses comparing BPE, WordPiece, and Unigram tokenization algorithms reveal several findings:
-
Compression (Efficiency) – “Fertility Scores”:
- Unigram usually achieves the best compression, needing the fewest tokens per input sequence. Benchmarks show Unigram averaging about 2 tokens per instruction, compared to BPE’s 2.5–3 and WordPiece’s 4.5 (lower is better for compression).
- BPE offers a middle ground: better compression than WordPiece, a bit less than Unigram.
- WordPiece typically produces more tokens per sequence, which may be less efficient but can provide finer control for certain tasks1.
-
Downstream Task Accuracy:
- For tasks like function signature or parameter prediction in LLMs:
- BPE frequently achieves the highest or very close to the highest prediction accuracy, especially as vocabulary size increases (e.g., BPE-35K: 85.76% accuracy).
- WordPiece generally follows BPE, with solid but slightly lower performance.
- Unigram sometimes trails BPE and WordPiece in accuracy but can perform better for certain languages or preprocessing setups and excels in maximizing sequence compactness.
- Results depend on model architecture, language, and specific downstream task1.
- For tasks like function signature or parameter prediction in LLMs:
-
Vocabulary Size Effects:
-
Robustness & Flexibility:
- Unigram offers more flexible tokenization (supports multiple valid splits), which may enhance model robustness during training.
- BPE is faster and simpler in deployment; WordPiece sits in between but is less compressive than Unigram.
-
Statistical Significance:
Summary Table
Metric | BPE | WordPiece | Unigram |
---|---|---|---|
Tokens per Input (Fertility) | Medium (2.5–3) | High (4.5, less efficient) | Low (~2, most efficient) |
Accuracy | Highest or near-best | Slightly below BPE | Mixed: sometimes lowest, best compression |
Flexibility | Good | Moderate | Highest (sampling, robust) |
Best For | General NLP, LLMs | Tasks needing token granularity | Compression, multilingual, robust LLMs |
Takeaway:
- BPE provides a strong trade-off between accuracy and efficiency, making it a default choice in many LLMs.
- Unigram is best for maximum compression and robust tokenization, particularly in multilingual or complex input scenarios.
- WordPiece offers a balanced solution, with slightly more granularity, useful for certain research or legacy models1234.
Summary
- All three are used for modern LLMs and transformers (e.g., BPE in GPT, WordPiece in BERT, Unigram in T5/XLNet).
- Unigram is most flexible but computationally heavier.
- BPE is fastest and simplest.
- WordPiece is similar to BPE but more corpus-statistical, with some subtle implementation differences.