top of page

Whole Word Masking

Introduction


Whole Word Masking (WWM) is an improvement over the original BERT-style masking. It addresses some limitations of the original approach, particularly when dealing with subword tokenization.


How Whole Word Masking Works


In Whole Word Masking:


  • The text is first tokenized into words.

  • 15% of the words are randomly selected for masking.

  • For each selected word, all of its subword tokens are masked.

  • The masking process (80% [MASK], 10% random, 10% unchanged) is applied to the word as a whole, not individual subword tokens.


Example


Let's consider the sentence: "The quick brown fox jumps over the lazy dog."


Tokenization


With WordPiece tokenization (used by BERT), this might be tokenized as: ["The", "quick", "bro", "##wn", "fox", "jump", "##s", "over", "the", "lazy", "dog"]


Masking Process


Assuming "brown" and "jumps" are selected for masking:

Method

Masked Sentence

BERT-style

The quick [MASK] ##wn fox [MASK] ##s over the lazy dog

Whole Word

The quick [MASK] [MASK] fox [MASK] over the lazy dog

As you can see, Whole Word Masking ensures that all subword tokens of a selected word are masked together.



Key Differences Between BERT-style and Whole Word Masking

Aspect

BERT-style Masking

Whole Word Masking

Unit of masking

Individual tokens (including subwords)

Entire words

Handling of subwords

May mask only part of a word

Always masks all subwords of a chosen word

Consistency

Less consistent for multi-token words

More consistent for all words

Context preservation

May leave partial word information

Completely hides chosen words

Learning objective

Predict individual tokens

Predict entire words

Suitability for compounds

Less effective for compound words

More effective for compound words

Language sensitivity

Less sensitive to word boundaries

More sensitive to word boundaries

Pros & Cons of Whole Word Masking


Pros:


  • Improved Context Understanding:

    • Whole Word Masking (WWM) improves the model's ability to understand the context of a word by masking entire words instead of subword tokens. This leads to better semantic understanding, as the model learns to predict the entire word based on its surrounding context.

  • Better Handling of Rare Words:

    • In traditional token masking, rare words might be split into subwords, making it harder for the model to learn meaningful representations. WWM helps in preserving the integrity of rare words, leading to more accurate predictions and better handling of rare and out-of-vocabulary words.

  • Enhanced Coherence in Predictions:

    • WWM often results in more coherent predictions, as the model focuses on predicting meaningful whole words rather than fragmented subword tokens. This improves the quality of generated text and the accuracy of downstream tasks like named entity recognition and question answering.

  • More Realistic Training Data:

    • By masking entire words, the training data becomes more representative of real-world scenarios where words are the basic units of language. This leads to better generalization and performance on various NLP tasks.

Cons:


  • Increased Training Complexity:

    • WWM increases the complexity of the training process, as the model has to predict entire words rather than simpler subword units. This can lead to longer training times and the need for more computational resources.

  • Potential Data Sparsity:

    • Masking entire words can lead to data sparsity issues, especially for low-frequency words. The model may encounter fewer instances of these words during training, which can affect its ability to learn robust representations for rare words.

  • Reduced Flexibility with Subwords:

    • Traditional subword tokenization methods, like WordPiece, offer flexibility in handling words with different prefixes, suffixes, and compound forms. WWM loses some of this flexibility, which can be a disadvantage for languages with complex morphology or for tasks that benefit from finer-grained tokenization.

  • Masking Efficiency:

    • WWM may result in masking larger portions of the input text, reducing the number of masked positions the model can learn from in a given training step. This can slow down the convergence of the model during training.


Summary


Whole Word Masking offers significant advantages in terms of contextual understanding, handling rare words, and generating coherent predictions, making it a valuable technique for improving the performance of Masked Language Models. However, it also introduces challenges related to training complexity, data sparsity, and reduced flexibility in tokenization. Careful consideration of these pros and cons is essential when implementing WWM in practical NLP applications.

Commentaires

Noté 0 étoile sur 5.
Pas encore de note

Ajouter une note
bottom of page