Introduction
Whole Word Masking (WWM) is an improvement over the original BERT-style masking. It addresses some limitations of the original approach, particularly when dealing with subword tokenization.
How Whole Word Masking Works
In Whole Word Masking:
The text is first tokenized into words.
15% of the words are randomly selected for masking.
For each selected word, all of its subword tokens are masked.
The masking process (80% [MASK], 10% random, 10% unchanged) is applied to the word as a whole, not individual subword tokens.
Example
Let's consider the sentence: "The quick brown fox jumps over the lazy dog."
Tokenization
With WordPiece tokenization (used by BERT), this might be tokenized as: ["The", "quick", "bro", "##wn", "fox", "jump", "##s", "over", "the", "lazy", "dog"]
Masking Process
Assuming "brown" and "jumps" are selected for masking:
Method | Masked Sentence |
BERT-style | The quick [MASK] ##wn fox [MASK] ##s over the lazy dog |
Whole Word | The quick [MASK] [MASK] fox [MASK] over the lazy dog |
As you can see, Whole Word Masking ensures that all subword tokens of a selected word are masked together.
Key Differences Between BERT-style and Whole Word Masking
Aspect | BERT-style Masking | Whole Word Masking |
Unit of masking | Individual tokens (including subwords) | Entire words |
Handling of subwords | May mask only part of a word | Always masks all subwords of a chosen word |
Consistency | Less consistent for multi-token words | More consistent for all words |
Context preservation | May leave partial word information | Completely hides chosen words |
Learning objective | Predict individual tokens | Predict entire words |
Suitability for compounds | Less effective for compound words | More effective for compound words |
Language sensitivity | Less sensitive to word boundaries | More sensitive to word boundaries |
Pros & Cons of Whole Word Masking
Pros:
Improved Context Understanding:
Whole Word Masking (WWM) improves the model's ability to understand the context of a word by masking entire words instead of subword tokens. This leads to better semantic understanding, as the model learns to predict the entire word based on its surrounding context.
Better Handling of Rare Words:
In traditional token masking, rare words might be split into subwords, making it harder for the model to learn meaningful representations. WWM helps in preserving the integrity of rare words, leading to more accurate predictions and better handling of rare and out-of-vocabulary words.
Enhanced Coherence in Predictions:
WWM often results in more coherent predictions, as the model focuses on predicting meaningful whole words rather than fragmented subword tokens. This improves the quality of generated text and the accuracy of downstream tasks like named entity recognition and question answering.
More Realistic Training Data:
By masking entire words, the training data becomes more representative of real-world scenarios where words are the basic units of language. This leads to better generalization and performance on various NLP tasks.
Cons:
Increased Training Complexity:
WWM increases the complexity of the training process, as the model has to predict entire words rather than simpler subword units. This can lead to longer training times and the need for more computational resources.
Potential Data Sparsity:
Masking entire words can lead to data sparsity issues, especially for low-frequency words. The model may encounter fewer instances of these words during training, which can affect its ability to learn robust representations for rare words.
Reduced Flexibility with Subwords:
Traditional subword tokenization methods, like WordPiece, offer flexibility in handling words with different prefixes, suffixes, and compound forms. WWM loses some of this flexibility, which can be a disadvantage for languages with complex morphology or for tasks that benefit from finer-grained tokenization.
Masking Efficiency:
WWM may result in masking larger portions of the input text, reducing the number of masked positions the model can learn from in a given training step. This can slow down the convergence of the model during training.
Summary
Whole Word Masking offers significant advantages in terms of contextual understanding, handling rare words, and generating coherent predictions, making it a valuable technique for improving the performance of Masked Language Models. However, it also introduces challenges related to training complexity, data sparsity, and reduced flexibility in tokenization. Careful consideration of these pros and cons is essential when implementing WWM in practical NLP applications.
Commentaires