skip to content
luminary.blog
by Oz Akan
cute robot sketch

Subword Tokenization Algorithms

Understanding the algorithms behind tokenization in Large Language Models.

/ 4 min read

Table of Contents

Byte Pair Encoding (BPE), Unigram, and WordPiece are all subword tokenization algorithms widely used in modern NLP models. Here’s a breakdown of how each works, along with similarities and differences:

1. Byte Pair Encoding (BPE)

Used in GPT (OpenAI), RoBERTa, BART, and many others.

  • How it works:

    • Begins with a base vocabulary of all unique characters.
    • Scans a large text corpus to find the most frequent pair of adjacent symbols (could be characters or already-merged subwords).
    • Merges that pair into a new symbol (subword).
    • Repeats merging until reaching a target vocabulary size or number of merges.
    • When encoding new text, uses the longest possible subwords from its learned vocabulary.
  • Example:

    • “Reading” → Initially [“R”, “e”, “a”, “d”, “i”, “n”, “g”]
    • After merges: [“Read”, “ing”]

2. Unigram Language Model

Used in models like T5, XLNet, ALBERT, mBART.

  • How it works:

    • Starts with a large set of potential subwords (including all possible splits).
    • Assigns a probability to each subword.
    • Uses statistical modeling (typically EM algorithm) to drop less probable subwords iteratively, shrinking the vocabulary.
    • When encoding new text, finds the segmentation (split into subwords) that yields the highest probability given the trained model.
  • Example:

    • Could segment “enjoyed” as [“en”, “joyed”] or [“enjoy”, “ed”], whichever is highest probability.

3. WordPiece

Originally developed for Google’s BERT and its variants.

  • How it works:

    • Very similar to BPE; starts with individual characters as tokens.
    • Merges pairs based on maximizing the likelihood of the training data, choosing merges that most increase the probability of the corpus.
    • Used in BERT, originally developed for speech recognition.
  • Segmentation rules:

    • Often uses a special marker (e.g., ##) for subwords that appear in the middle or end of a word (e.g., “unbelievable” → [“un”, “##believable”]).

Similarities

  • All are subword tokenization algorithms—breaking words into chunks to efficiently handle rare or unknown terms and avoid out-of-vocabulary errors.
  • All construct a finite vocabulary of subwords during training, usable to encode any input text.
  • All support language diversity, rare word handling, and compression.

Differences

FeatureBPEUnigramWordPiece
Merge RuleMost frequent pairsProbabilistic model; prune lowMost likely pairs (likelihood)
Algorithm TypeDeterministic, greedy mergesProbabilistic, EM optimizationDeterministic but stat-based
EncodingGreedly uses longest subwordsFinds best split (max prob)Greedily uses longest subwords
Vocabulary Gen.All merges addedPrunes unnecessary subwordsMerges for likelihood
Output varietyAlways same for inputMay have multiple valid splitsSame input ≈ same output
Marker UsageNo special symbolNo special symbolSpecial marker (”##”)
  • BPE and WordPiece are similar in being greedy merge-based approaches, but WordPiece merges based on maximizing training data likelihood rather than mere frequency.
  • Unigram differs by keeping a probabilistic model of all possible subword splits and pruning; for encoding, it chooses the most probable splitting, not just the longest matches.

Benchmark

Recent benchmarks and analyses comparing BPE, WordPiece, and Unigram tokenization algorithms reveal several findings:

  • Compression (Efficiency) – “Fertility Scores”:

    • Unigram usually achieves the best compression, needing the fewest tokens per input sequence. Benchmarks show Unigram averaging about 2 tokens per instruction, compared to BPE’s 2.5–3 and WordPiece’s 4.5 (lower is better for compression).
    • BPE offers a middle ground: better compression than WordPiece, a bit less than Unigram.
    • WordPiece typically produces more tokens per sequence, which may be less efficient but can provide finer control for certain tasks1.
  • Downstream Task Accuracy:

    • For tasks like function signature or parameter prediction in LLMs:
      • BPE frequently achieves the highest or very close to the highest prediction accuracy, especially as vocabulary size increases (e.g., BPE-35K: 85.76% accuracy).
      • WordPiece generally follows BPE, with solid but slightly lower performance.
      • Unigram sometimes trails BPE and WordPiece in accuracy but can perform better for certain languages or preprocessing setups and excels in maximizing sequence compactness.
    • Results depend on model architecture, language, and specific downstream task1.
  • Vocabulary Size Effects:

    • Increasing vocabulary size improves all tokenizers up to a point, but the effect tapers off (diminishing returns).
    • All methods perform comparably at larger vocabulary sizes (25K+), but Unigram and BPE may yield more stable and efficient segmentations as vocab size grows21.
  • Robustness & Flexibility:

    • Unigram offers more flexible tokenization (supports multiple valid splits), which may enhance model robustness during training.
    • BPE is faster and simpler in deployment; WordPiece sits in between but is less compressive than Unigram.
  • Statistical Significance:

    • Comprehensive studies suggest no single tokenizer algorithm consistently and significantly outperforms the others across all tasks and languages. Performance differences are often task-dependent and minor among leading methods at standard vocabulary sizes34.

Summary Table

MetricBPEWordPieceUnigram
Tokens per Input (Fertility)Medium (2.5–3)High (4.5, less efficient)Low (~2, most efficient)
AccuracyHighest or near-bestSlightly below BPEMixed: sometimes lowest, best compression
FlexibilityGoodModerateHighest (sampling, robust)
Best ForGeneral NLP, LLMsTasks needing token granularityCompression, multilingual, robust LLMs

Takeaway:

  • BPE provides a strong trade-off between accuracy and efficiency, making it a default choice in many LLMs.
  • Unigram is best for maximum compression and robust tokenization, particularly in multilingual or complex input scenarios.
  • WordPiece offers a balanced solution, with slightly more granularity, useful for certain research or legacy models1234.

Summary

  • All three are used for modern LLMs and transformers (e.g., BPE in GPT, WordPiece in BERT, Unigram in T5/XLNet).
  • Unigram is most flexible but computationally heavier.
  • BPE is fastest and simplest.
  • WordPiece is similar to BPE but more corpus-statistical, with some subtle implementation differences.

Footnotes

  1. https://www.ndss-symposium.org/wp-content/uploads/bar2025-final13.pdf 2 3 4

  2. https://arxiv.org/html/2502.20273v2 2

  3. https://training.continuumlabs.ai/training/the-fine-tuning-process/tokenization/tokenization-is-more-than-compression 2

  4. https://blog.octanove.org/guide-to-subword-tokenization/ 2