Subword Tokenization Algorithms • luminary.blog

Byte Pair Encoding (BPE), Unigram, and WordPiece are all subword tokenization algorithms widely used in modern NLP models. Here’s a breakdown of how each works, along with similarities and differences:

1. Byte Pair Encoding (BPE)

Used in GPT (OpenAI), RoBERTa, BART, and many others.

How it works:
- Begins with a base vocabulary of all unique characters.
- Scans a large text corpus to find the most frequent pair of adjacent symbols (could be characters or already-merged subwords).
- Merges that pair into a new symbol (subword).
- Repeats merging until reaching a target vocabulary size or number of merges.
- When encoding new text, uses the longest possible subwords from its learned vocabulary.
Example:
- “Reading” → Initially [“R”, “e”, “a”, “d”, “i”, “n”, “g”]
- After merges: [“Read”, “ing”]

2. Unigram Language Model

Used in models like T5, XLNet, ALBERT, mBART.

How it works:
- Starts with a large set of potential subwords (including all possible splits).
- Assigns a probability to each subword.
- Uses statistical modeling (typically EM algorithm) to drop less probable subwords iteratively, shrinking the vocabulary.
- When encoding new text, finds the segmentation (split into subwords) that yields the highest probability given the trained model.
Example:
- Could segment “enjoyed” as [“en”, “joyed”] or [“enjoy”, “ed”], whichever is highest probability.

3. WordPiece

Originally developed for Google’s BERT and its variants.

How it works:
- Very similar to BPE; starts with individual characters as tokens.
- Merges pairs based on maximizing the likelihood of the training data, choosing merges that most increase the probability of the corpus.
- Used in BERT, originally developed for speech recognition.
Segmentation rules:
- Often uses a special marker (e.g., ##) for subwords that appear in the middle or end of a word (e.g., “unbelievable” → [“un”, “##believable”]).

Similarities

All are subword tokenization algorithms—breaking words into chunks to efficiently handle rare or unknown terms and avoid out-of-vocabulary errors.
All construct a finite vocabulary of subwords during training, usable to encode any input text.
All support language diversity, rare word handling, and compression.

Differences

Feature	BPE	Unigram	WordPiece
Merge Rule	Most frequent pairs	Probabilistic model; prune low	Most likely pairs (likelihood)
Algorithm Type	Deterministic, greedy merges	Probabilistic, EM optimization	Deterministic but stat-based
Encoding	Greedly uses longest subwords	Finds best split (max prob)	Greedily uses longest subwords
Vocabulary Gen.	All merges added	Prunes unnecessary subwords	Merges for likelihood
Output variety	Always same for input	May have multiple valid splits	Same input ≈ same output
Marker Usage	No special symbol	No special symbol	Special marker (”##”)

BPE and WordPiece are similar in being greedy merge-based approaches, but WordPiece merges based on maximizing training data likelihood rather than mere frequency.
Unigram differs by keeping a probabilistic model of all possible subword splits and pruning; for encoding, it chooses the most probable splitting, not just the longest matches.

Benchmark

Recent benchmarks and analyses comparing BPE, WordPiece, and Unigram tokenization algorithms reveal several findings:

Compression (Efficiency) – “Fertility Scores”:
- Unigram usually achieves the best compression, needing the fewest tokens per input sequence. Benchmarks show Unigram averaging about 2 tokens per instruction, compared to BPE’s 2.5–3 and WordPiece’s 4.5 (lower is better for compression).
- BPE offers a middle ground: better compression than WordPiece, a bit less than Unigram.
- WordPiece typically produces more tokens per sequence, which may be less efficient but can provide finer control for certain tasks¹.
Downstream Task Accuracy:
- For tasks like function signature or parameter prediction in LLMs:
  - BPE frequently achieves the highest or very close to the highest prediction accuracy, especially as vocabulary size increases (e.g., BPE-35K: 85.76% accuracy).
  - WordPiece generally follows BPE, with solid but slightly lower performance.
  - Unigram sometimes trails BPE and WordPiece in accuracy but can perform better for certain languages or preprocessing setups and excels in maximizing sequence compactness.
- Results depend on model architecture, language, and specific downstream task¹.
Vocabulary Size Effects:
- Increasing vocabulary size improves all tokenizers up to a point, but the effect tapers off (diminishing returns).
- All methods perform comparably at larger vocabulary sizes (25K+), but Unigram and BPE may yield more stable and efficient segmentations as vocab size grows²¹.
Robustness & Flexibility:
- Unigram offers more flexible tokenization (supports multiple valid splits), which may enhance model robustness during training.
- BPE is faster and simpler in deployment; WordPiece sits in between but is less compressive than Unigram.
Statistical Significance:
- Comprehensive studies suggest no single tokenizer algorithm consistently and significantly outperforms the others across all tasks and languages. Performance differences are often task-dependent and minor among leading methods at standard vocabulary sizes³⁴.

Summary Table

Metric	BPE	WordPiece	Unigram
Tokens per Input (Fertility)	Medium (2.5–3)	High (4.5, less efficient)	Low (~2, most efficient)
Accuracy	Highest or near-best	Slightly below BPE	Mixed: sometimes lowest, best compression
Flexibility	Good	Moderate	Highest (sampling, robust)
Best For	General NLP, LLMs	Tasks needing token granularity	Compression, multilingual, robust LLMs

Takeaway:

BPE provides a strong trade-off between accuracy and efficiency, making it a default choice in many LLMs.
Unigram is best for maximum compression and robust tokenization, particularly in multilingual or complex input scenarios.
WordPiece offers a balanced solution, with slightly more granularity, useful for certain research or legacy models¹²³⁴.

Summary

All three are used for modern LLMs and transformers (e.g., BPE in GPT, WordPiece in BERT, Unigram in T5/XLNet).
Unigram is most flexible but computationally heavier.
BPE is fastest and simplest.
WordPiece is similar to BPE but more corpus-statistical, with some subtle implementation differences.

Footnotes

← What is LLM Inference?

Words, Tokens and Embeddings →