Byte Pair Encoding: Cracking the Subword Code

Ever wondered how chatbots understand your typos or how translation apps handle words they've never seen before? Enter Byte Pair Encoding (BPE), the unsung hero of modern Natural Language Processing (NLP).

Why Subword Tokenization?

Imagine trying to learn a language by memorizing entire sentences. Sounds tedious, right? That's essentially what traditional word-based tokenization does to our poor AI models. They struggle with new words, get overwhelmed by complex languages, and their vocabularies balloon to ridiculous sizes.

Subword tokenization, like BPE, is the language-learning hack we've been waiting for. It's like teaching our models to recognize prefixes, suffixes, and root words, making them more flexible and efficient.

BPE: From Data Compression to Language Ninja

BPE started its life compressing data back in the '90s. Fast forward to 2015, and some clever folks realized it could revolutionize NLP. Here's the gist:

Start with individual characters.
Find the most common pair of adjacent characters.
Merge them into a new token.
Repeat until you've got a manageable vocabulary.

Byte Pair Encoding: A Concise, Real-World Example

Let's walk through the Byte Pair Encoding (BPE) process using a simple sentence:

"The fox jumped over the fence!"

Initial Setup

Step 1: Initialize Vocabulary

Create a vocabulary of individual characters from the input text. Initial vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!']

Step 2: Tokenize Text

Represent the text as a sequence of these character tokens. Encoded text: T h e f o x j u m p e d o v e r t h e f e n c e !

Total tokens: 17

BPE Algorithm Iterations

Iteration 1

Step 3: Count Pairs

Count the frequency of every adjacent pair of tokens in the text.

Th: 1, he: 2, e : 2, f: 2, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1,

mp: 1, pe: 1, ed: 1, d : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1,

th: 1, f: 1, fe: 1, en: 1, nc: 1, ce: 1, e!: 1

Step 4: Find Most Frequent Pair

Identify the most frequent pair of tokens. Merge most frequent: 'he' and 'e ' (tied, we'll choose 'he')

Step 5: Merge Pair

Create a new token by merging this pair and add it to the vocabulary. New token: 'he'

Step 6: Update Text

Replace all occurrences of the merged pair in the text with the new token. Updated vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he'] Encoded text: T he f o x j u m p e d o v e r t he f e n c e ! Total tokens: 18

Iteration 2

Step 3: Count Pairs (Repeat)

The: 1, he : 2, f: 2, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1,

mp: 1, pe: 1, ed: 1, d : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1,

th: 1, f: 1, fe: 1, en: 1, nc: 1, ce: 1, e!: 1

Step 4 &5: Find & Merge most frequent pair

'he '

Step 6: Update Text

Updated vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he', 'he '] Encoded text: T he f o x j u m p e d o v e r t he f e n c e ! Total tokens: 19

Step 7

Repeat process until the vocab size is reached or no pair appears more than once.

Final vocabulary: ['T', 'h', 'e', ' ', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 'd', 'v', 'r', 'n', 'c', '!', 'he', 'he ', 'he f'] Final encoded text: T he fox j u m p e d o v e r t he fence ! Total tokens: 20

Key Takeaways

We started with 17 character tokens and ended with 20 tokens, including subwords and partial words.
The algorithm efficiently captured frequent bigrams like "he" and "he ".
The word "the" was partially captured as "he " and "T he".
Less frequent words ('jumped', 'over') remained as individual characters.
The algorithm stopped when no pair occurred more than once, preventing overfitting to this small sample.

Handling Unseen Tokens/Words

One of the key advantages of Byte Pair Encoding is its ability to handle unseen words by breaking them down into known subword units. Let's demonstrate this with an example using our trained BPE model.

Suppose we encounter a new sentence:

"The fox leaped over the barrier!"

Here's how our BPE model would tokenize this sentence:

"The" → T he
1. Recognized as a known combination
"fox" → fox
1. Recognized as a whole word (merged in Iteration 3)
"leaped" → l e a p e d
1. Unseen word, broken down into individual characters
"over" → o v e r
1. Unseen word, broken down into individual characters
"the" → t he
1. Recognized as a known combination
"barrier" → b a r r i e r
1. Unseen word, broken down into individual characters
"!" → !
1. Recognized as an individual token

Final tokenization: T he fox l e a p e d o v e r t he b a r r i e r !

Key Observations:

Known words/subwords: "The", "fox", and "the" are tokenized using the learned merges.
Unseen words: "leaped", "over", and "barrier" are broken down into individual characters.
Partial matches: Even though "over" appeared in our original sentence, our limited training didn't create a merged token for it, so it's treated as a new word.
Flexibility: This approach allows the model to handle any word, even if it's never seen before, by falling back to character-level encoding.
Efficiency for known parts: Common parts of words (like "he" in "the") are still efficiently encoded.

Byte Pair Encoding Flowchart

Scaling Byte Pair Encoding to a Multi-Sentence Corpus

Approach

Separate Processing: Each sentence is typically treated separately for pair counting.
Sentence Delimiters: Often, special tokens are used to mark the beginning and end of sentences.
No Cross-Sentence Merges: Pairs are not counted across sentence boundaries.

Let's assume that the following sentences represent our corpus

"The quick brown fox jumps over the lazy dog."
"A quick brown dog jumps over the lazy fox."
"The lazy fox quickly jumps over the brown dog."

Initial Setup

First, we'll add special tokens for sentence boundaries: <s> for start and </s> for end.

Initial vocabulary ['<s>', '</s>', 'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'x', 'j', 'm', 'p', 's', 'v', 'l', 'a', 'z', 'y', 'd', 'g', '.', 'A']

Encoded corpus

BPE Iteration

Now, when we count pairs, we do so within each sentence separately:

Sentence 1 pairs

<s>T: 1, Th: 1, he: 1, e : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, k : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, f: 1, fo: 1, ox: 1, x : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, l: 1, la: 1, az: 1, zy: 1, y : 1, d: 1, do: 1, og: 1, g.: 1, .</s>: 1

Sentence 2 pairs

<s>A: 1, A : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, k : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, d: 1, do: 1, og: 1, g : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, he: 1, e : 1, l: 1, la: 1, az: 1, zy: 1, y : 1, f: 1, fo: 1, ox: 1, x.: 1, .</s>: 1

Sentence 3 pairs

<s>T: 1, Th: 1, he: 1, e : 1, l: 1, la: 1, az: 1, zy: 1, y : 1, f: 1, fo: 1, ox: 1, x : 1, q: 1, qu: 1, ui: 1, ic: 1, ck: 1, kl: 1, ly: 1, y : 1, j: 1, ju: 1, um: 1, mp: 1, ps: 1, s : 1, o: 1, ov: 1, ve: 1, er: 1, r : 1, t: 1, th: 1, he: 1, e : 1, b: 1, br: 1, ro: 1, ow: 1, wn: 1, n : 1, d: 1, do: 1, og: 1, g.: 1, .</s>: 1

Merging Decision

To decide which pair to merge, we sum the frequencies across all sentences, but importantly, we don't create pairs across sentence boundaries.

Combined frequencies (top pairs):

he: 3, e : 3 ,qu: 3, ui: 3, ic: 3, ck: 3, ...

In this case, we have multiple pairs tied with the highest frequency of 3. Let's say we choose to merge 'he'.

Update

We would then update each sentence separately with the new merged token 'he'.

Why This Approach Matters

Meaningful Tokens: By not merging across sentence boundaries, we avoid creating tokens that don't represent meaningful linguistic units.
Sentence Integrity: This approach maintains the integrity of individual sentences, which is often important for downstream NLP tasks.
Flexibility: It allows for processing of streaming text or very large corpora where not all sentences are available simultaneously.
Consistent Tokenization: Ensures that the same sentence will be tokenized the same way regardless of surrounding context.

Challenges of Standard BPE

Greedy and Deterministic: BPE always chooses the most frequent pair, which may not always be optimal for downstream tasks.
Out-of-Vocabulary (OOV) Handling: While better than word-level tokenization, BPE can still struggle with rare words or unique spellings.
Lack of Morphological Awareness: BPE doesn't explicitly consider linguistic structure, potentially leading to suboptimal subword splits.
Fixed Vocabulary: Once trained, the BPE vocabulary is static, which can be limiting for evolving language or domain adaptation.
Computational Complexity: For large corpora, the iterative merging process can be time-consuming.
Segmentation Ambiguity: The same word might be tokenized differently depending on context, potentially losing consistency.

BPE Variants and Their Improvements

WordPiece
1. Improvement: Uses a likelihood-based approach instead of frequency.
2. Benefit: Often results in more linguistically meaningful subword units.
SentencePiece
1. Improvement: Treats the input as a sequence of unicode characters and can use either BPE or Unigram LM.
2. Benefit: Language-agnostic and handles any unicode characters without pre-tokenization.
Unigram Language Model
1. Improvement: Uses a probabilistic model to find the optimal segmentation.
2. Benefit: Can consider multiple possible segmentations, potentially leading to better tokenization.
BPE-Dropout
1. Improvement: Introduces randomness in the segmentation process during training.
2. Benefit: Improves model robustness and generalization to unseen words.
Dynamic BPE
1. Improvement: Allows for vocabulary adaptation during fine-tuning or inference.
2. Benefit: Better handles domain shifts or evolving language use.
Byte-Level BPE
1. Improvement: Operates on bytes instead of unicode characters.
2. Benefit: Guarantees a fixed-size vocabulary and can represent any string of bytes.