Jun 21, 20243 min read

Whole Word Masking

Introduction

Whole Word Masking (WWM) is an improvement over the original BERT-style masking. It addresses some limitations of the original approach, particularly when dealing with subword tokenization.

How Whole Word Masking Works

In Whole Word Masking:

The text is first tokenized into words.
15% of the words are randomly selected for masking.
For each selected word, all of its subword tokens are masked.
The masking process (80% [MASK], 10% random, 10% unchanged) is applied to the word as a whole, not individual subword tokens.

Example

Let's consider the sentence: "The quick brown fox jumps over the lazy dog."

Tokenization

With WordPiece tokenization (used by BERT), this might be tokenized as: ["The", "quick", "bro", "##wn", "fox", "jump", "##s", "over", "the", "lazy", "dog"]

Masking Process

Assuming "brown" and "jumps" are selected for masking:

Method	Masked Sentence
BERT-style	The quick [MASK] ##wn fox [MASK] ##s over the lazy dog
Whole Word	The quick [MASK] [MASK] fox [MASK] over the lazy dog

As you can see, Whole Word Masking ensures that all subword tokens of a selected word are masked together.

Key Differences Between BERT-style and Whole Word Masking

Aspect	BERT-style Masking	Whole Word Masking
Unit of masking	Individual tokens (including subwords)	Entire words
Handling of subwords	May mask only part of a word	Always masks all subwords of a chosen word
Consistency	Less consistent for multi-token words	More consistent for all words
Context preservation	May leave partial word information	Completely hides chosen words
Learning objective	Predict individual tokens	Predict entire words
Suitability for compounds	Less effective for compound words	More effective for compound words
Language sensitivity	Less sensitive to word boundaries	More sensitive to word boundaries

Pros & Cons of Whole Word Masking

Pros:

Improved Context Understanding:
- Whole Word Masking (WWM) improves the model's ability to understand the context of a word by masking entire words instead of subword tokens. This leads to better semantic understanding, as the model learns to predict the entire word based on its surrounding context.
Better Handling of Rare Words:
- In traditional token masking, rare words might be split into subwords, making it harder for the model to learn meaningful representations. WWM helps in preserving the integrity of rare words, leading to more accurate predictions and better handling of rare and out-of-vocabulary words.
Enhanced Coherence in Predictions:
- WWM often results in more coherent predictions, as the model focuses on predicting meaningful whole words rather than fragmented subword tokens. This improves the quality of generated text and the accuracy of downstream tasks like named entity recognition and question answering.
More Realistic Training Data:
- By masking entire words, the training data becomes more representative of real-world scenarios where words are the basic units of language. This leads to better generalization and performance on various NLP tasks.

Cons:

Increased Training Complexity:
- WWM increases the complexity of the training process, as the model has to predict entire words rather than simpler subword units. This can lead to longer training times and the need for more computational resources.
Potential Data Sparsity:
- Masking entire words can lead to data sparsity issues, especially for low-frequency words. The model may encounter fewer instances of these words during training, which can affect its ability to learn robust representations for rare words.
Reduced Flexibility with Subwords:
- Traditional subword tokenization methods, like WordPiece, offer flexibility in handling words with different prefixes, suffixes, and compound forms. WWM loses some of this flexibility, which can be a disadvantage for languages with complex morphology or for tasks that benefit from finer-grained tokenization.
Masking Efficiency:
- WWM may result in masking larger portions of the input text, reducing the number of masked positions the model can learn from in a given training step. This can slow down the convergence of the model during training.

Summary

Whole Word Masking offers significant advantages in terms of contextual understanding, handling rare words, and generating coherent predictions, making it a valuable technique for improving the performance of Masked Language Models. However, it also introduces challenges related to training complexity, data sparsity, and reduced flexibility in tokenization. Careful consideration of these pros and cons is essential when implementing WWM in practical NLP applications.