Ever wondered how chatbots understand your typos or how translation apps handle words they've never seen before? Enter Byte Pair Encoding (BPE), the unsung hero of modern Natural Language Processing (NLP).
Why Subword Tokenization?
Imagine trying to learn a language by memorizing entire sentences. Sounds tedious, right? That's essentially what traditional word-based tokenization does to our poor AI models. They struggle with new words, get overwhelmed by complex languages, and their vocabularies balloon to ridiculous sizes.
Subword tokenization, like BPE, is the language-learning hack we've been waiting for. It's like teaching our models to recognize prefixes, suffixes, and root words, making them more flexible and efficient.
BPE: From Data Compression to Language Ninja
BPE started its life compressing data back in the '90s. Fast forward to 2015, and some clever folks realized it could revolutionize NLP. Here's the gist:
Start with individual characters.
Find the most common pair of adjacent characters.
Merge them into a new token.
Repeat until you've got a manageable vocabulary.
Byte Pair Encoding: A Concise, Real-World Example
Let's walk through the Byte Pair Encoding (BPE) process using a simple sentence:
"The fox jumped over the fence!"
Initial Setup
Step 1: Initialize Vocabulary
Create a vocabulary of individual characters from the input text. Â Initial vocabulary:Â ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!']
Step 2: Tokenize Text
Represent the text as a sequence of these character tokens. Encoded text: T h e f o x j u m p e d o v e r t h e f e n c e !
Total tokens:Â 17
BPE Algorithm Iterations
Iteration 1
Step 3: Count Pairs
Count the frequency of every adjacent pair of tokens in the text.
Th: 1, he: 2, e : 2, f: 2, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1,
mp: 1, pe: 1, ed: 1, d : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1,
th: 1, f: 1, fe: 1, en: 1, nc: 1, ce: 1, e!: 1
Step 4: Find Most Frequent Pair
Identify the most frequent pair of tokens. Merge most frequent:Â 'he' and 'e 'Â (tied, we'll choose 'he')
Step 5: Merge Pair
 Create a new token by merging this pair and add it to the vocabulary. New token: 'he'
Step 6: Update Text
Replace all occurrences of the merged pair in the text with the new token. Updated vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he'] Encoded text: T he f o x j u m p e d o v e r t he f e n c e ! Total tokens: 18
Iteration 2
Step 3: Count Pairs (Repeat)
The: 1, he : 2, f: 2, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1,
mp: 1, pe: 1, ed: 1, d : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1,
th: 1, f: 1, fe: 1, en: 1, nc: 1, ce: 1, e!: 1
Step 4 &5: Find & Merge most frequent pair
'he '
Step 6: Update TextÂ
Updated vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he', 'he ']Â Encoded text:Â T he f o x j u m p e d o v e r t he f e n c e ! Total tokens:Â 19
Step 7
Repeat process until the vocab size is reached or no pair appears more than once.
Final vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he', 'he ', 'he f'] Final encoded text: T he fox j u m p e d o v e r t he fence ! Total tokens: 20
Key Takeaways
We started with 17 character tokens and ended with 20 tokens, including subwords and partial words.
The algorithm efficiently captured frequent bigrams like "he" and "he ".
The word "the" was partially captured as "he " and "T he".
Less frequent words ('jumped', 'over') remained as individual characters.
The algorithm stopped when no pair occurred more than once, preventing overfitting to this small sample.
Handling Unseen Tokens/Words
One of the key advantages of Byte Pair Encoding is its ability to handle unseen words by breaking them down into known subword units. Let's demonstrate this with an example using our trained BPE model.
Suppose we encounter a new sentence:
"The fox leaped over the barrier!"
Here's how our BPE model would tokenize this sentence:
"The" → T heÂ
Recognized as a known combination
"fox" → foxÂ
Recognized as a whole word (merged in Iteration 3)
"leaped" → l e a p e dÂ
Unseen word, broken down into individual characters
"over" → o v e rÂ
Unseen word, broken down into individual characters
"the" → t heÂ
Recognized as a known combination
"barrier" → b a r r i e rÂ
Unseen word, broken down into individual characters
"!" → !Â
Recognized as an individual token
Final tokenization: T he fox l e a p e d o v e r t he b a r r i e r !
Key Observations:
Known words/subwords:Â "The", "fox", and "the" are tokenized using the learned merges.
Unseen words: "leaped", "over", and "barrier" are broken down into individual characters.
Partial matches: Even though "over" appeared in our original sentence, our limited training didn't create a merged token for it, so it's treated as a new word.
Flexibility: This approach allows the model to handle any word, even if it's never seen before, by falling back to character-level encoding.
Efficiency for known parts:Â Common parts of words (like "he" in "the") are still efficiently encoded.
Byte Pair Encoding Flowchart
Scaling Byte Pair Encoding to a Multi-Sentence Corpus
Approach
Separate Processing: Each sentence is typically treated separately for pair counting.
Sentence Delimiters: Often, special tokens are used to mark the beginning and end of sentences.
No Cross-Sentence Merges: Pairs are not counted across sentence boundaries.
Let's assume that the following sentences represent our corpus
"The quick brown fox jumps over the lazy dog."
"A quick brown dog jumps over the lazy fox."
"The lazy fox quickly jumps over the brown dog."
Initial Setup
First, we'll add special tokens for sentence boundaries: <s>Â for start and </s>Â for end.
Initial vocabulary ['<s>', '</s>', 'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'x', 'j', 'm', 'p', 's', 'v', 'l', 'a', 'z', 'y', 'd', 'g', '.', 'A']
Encoded corpus
<s> T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g . </s>
<s> A q u i c k b r o w n d o g j u m p s o v e r t h e l a z y f o x . </s>
<s> T h e l a z y f o x q u i c k l y j u m p s o v e r t h e b r o w n d o g . </s>
BPE Iteration
Now, when we count pairs, we do so within each sentence separately:
Sentence 1 pairs
<s>T: 1, Th: 1, he: 1, e : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, k : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, f: 1, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, l: 1, la: 1, az: 1, zy: 1, y : 1, d: 1, do: 1, og: 1, g.: 1, .</s>: 1
Sentence 2 pairs
<s>A: 1, A : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, k : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, d: 1, do: 1, og: 1, g : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, he: 1, e : 1, l: 1, la: 1, az: 1, zy: 1, y : 1, f: 1, fo: 1, ox: 1, x.: 1, .</s>: 1
Sentence 3 pairs
<s>T: 1, Th: 1, he: 1, e : 1, l: 1, la: 1, az: 1, zy: 1, y : 1, f: 1, fo: 1, ox: 1, x : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, kl: 1, ly: 1, y : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, he: 1, e : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, d: 1, do: 1, og: 1, g.: 1, .</s>: 1
Merging Decision
To decide which pair to merge, we sum the frequencies across all sentences, but importantly, we don't create pairs across sentence boundaries.
Combined frequencies (top pairs):
he: 3, e : 3 ,qu: 3, ui: 3, ic: 3, ck: 3, ...
In this case, we have multiple pairs tied with the highest frequency of 3. Let's say we choose to merge 'he'.
Update
We would then update each sentence separately with the new merged token 'he'.
Why This Approach Matters
Meaningful Tokens: By not merging across sentence boundaries, we avoid creating tokens that don't represent meaningful linguistic units.
Sentence Integrity: This approach maintains the integrity of individual sentences, which is often important for downstream NLP tasks.
Flexibility: It allows for processing of streaming text or very large corpora where not all sentences are available simultaneously.
Consistent Tokenization: Ensures that the same sentence will be tokenized the same way regardless of surrounding context.
Challenges of Standard BPE
Greedy and Deterministic: BPE always chooses the most frequent pair, which may not always be optimal for downstream tasks.
Out-of-Vocabulary (OOV) Handling: While better than word-level tokenization, BPE can still struggle with rare words or unique spellings.
Lack of Morphological Awareness: BPE doesn't explicitly consider linguistic structure, potentially leading to suboptimal subword splits.
Fixed Vocabulary: Once trained, the BPE vocabulary is static, which can be limiting for evolving language or domain adaptation.
Computational Complexity: For large corpora, the iterative merging process can be time-consuming.
Segmentation Ambiguity: The same word might be tokenized differently depending on context, potentially losing consistency.
BPE Variants and Their Improvements
WordPiece
Improvement: Uses a likelihood-based approach instead of frequency.
Benefit: Often results in more linguistically meaningful subword units.
SentencePiece
Improvement: Treats the input as a sequence of unicode characters and can use either BPE or Unigram LM.
Benefit: Language-agnostic and handles any unicode characters without pre-tokenization.
Unigram Language ModelÂ
Improvement: Uses a probabilistic model to find the optimal segmentation.
Benefit: Can consider multiple possible segmentations, potentially leading to better tokenization.
BPE-Dropout
Improvement: Introduces randomness in the segmentation process during training.
Benefit: Improves model robustness and generalization to unseen words.
Dynamic BPE
Improvement: Allows for vocabulary adaptation during fine-tuning or inference.
Benefit: Better handles domain shifts or evolving language use.
Byte-Level BPE
Improvement: Operates on bytes instead of unicode characters.
Benefit: Guarantees a fixed-size vocabulary and can represent any string of bytes.