SpanBERT: Span Based Masking

Introduction to SpanBERT

SpanBERT is an extension of the BERT (Bidirectional Encoder Representations from Transformers) model, designed to better represent and predict spans of text. Developed by researchers at Facebook AI and the University of Washington, SpanBERT aims to improve performance on span-based tasks such as question answering and coreference resolution.

The key innovation of SpanBERT lies in its pre-training objective, which focuses on predicting entire spans of text rather than individual tokens. This approach allows the model to capture broader context and relationships between words, leading to improved performance on various natural language processing tasks.

Working Process of SpanBERT

SpanBERT's working process can be broken down into several key components:

Span Selection
1. Instead of masking individual tokens like BERT, SpanBERT masks contiguous spans of text.
2. The process begins by selecting spans of text to mask:
3. Span lengths are sampled from a geometric distribution.
4. This distribution is chosen to favor shorter spans while still including some longer ones.
5. The total number of masked tokens is kept at around 15% of the input, similar to BERT.
Span Masking
1. Once spans are selected, they are masked in a way similar to BERT:
  1. 80% of the time, the entire span is replaced with [MASK] tokens.
  2. 10% of the time, the span is replaced with random tokens.
  3. 10% of the time, the original tokens are left unchanged.
2. This masking strategy encourages the model to learn robust representations that can handle various types of noise and perturbations in the input.
Span Boundary Objective (SBO)
1. SpanBERT introduces a new training objective called the Span Boundary Objective (SBO).
2. For each masked span, the model is trained to predict the entire content of the span using only the representations of the tokens at the span's boundary (the tokens immediately before and after the masked span).
3. This objective encourages the model to store span-level information at the boundary tokens, allowing for better representation of content within spans.
Span-Level Predictions
1. During the prediction phase, SpanBERT uses the representations of the boundary tokens to predict each token in the masked span.
2. This is done sequentially, from left to right within the span.
3. The model uses the boundary token representations along with the positions of the tokens being predicted relative to the span boundaries.
Single Sequence Training
1. Unlike BERT, which uses a next sentence prediction (NSP) task, SpanBERT is trained on single sequences up to the maximum length.
2. This approach allows the model to learn longer-range dependencies within a document.
Position Embeddings
1. SpanBERT uses absolute position embeddings, similar to BERT.
2. However, due to the single sequence training, these embeddings more effectively capture position information within longer contexts.
Fine-Tuning Process
1. After pre-training, SpanBERT can be fine-tuned for specific downstream tasks, similar to BERT.
2. The fine-tuning process adapts the pre-trained model to specific span-based tasks like question answering or coreference resolution.
3. During fine-tuning, the model's parameters are updated using task-specific training data and loss functions.

Workflow Summary

Input text is divided into spans.
Some spans are selected for masking based on a geometric distribution.
Selected spans are masked (replaced with [MASK], random tokens, or left unchanged).
The model is trained to predict the content of masked spans using only the boundary tokens.
Predictions are made sequentially for each token in the masked span.
The model learns to capture span-level information and longer-range dependencies.
After pre-training, the model can be fine-tuned for specific span-based tasks.
This approach allows SpanBERT to develop a stronger understanding of relationships between words within spans and across longer distances in the text, leading to improved performance on various natural language processing tasks, especially those involving spans of text.

SpanBERT Working Process: An Example

Let's walk through the SpanBERT process using the following example sentence:

"The quick brown fox jumps over the lazy dog in the park."

1. Span Selection and Masking

First, SpanBERT selects spans of text to mask. Let's say it chooses two spans:

"brown fox"
"the lazy"

Our sentence now looks like this (with masked spans in brackets):

"The quick [brown fox] jumps over [the lazy] dog in the park."

2. Applying Mask Tokens

Now, SpanBERT applies its masking strategy. For this example, let's say:

"brown fox" is replaced with [MASK] tokens
"the lazy" is left unchanged (10% probability)

Our sentence becomes:

"The quick [MASK] [MASK] jumps over [the lazy] dog in the park."

3. Span Boundary Objective (SBO)

For the masked span "brown fox", SpanBERT will try to predict the content using only the boundary tokens:

Left boundary: "quick"
Right boundary: "jumps"

The model uses the representations of "quick" and "jumps" to predict "brown fox".

4. Span-Level Predictions

SpanBERT predicts the masked tokens sequentially:

First, it predicts "brown" using the boundary tokens "quick" and "jumps".
Then, it predicts "fox" using the boundary tokens, the prediction for "brown", and the relative positions.

5. Training Process

During training, SpanBERT learns to:

Encode span-level information into the boundary tokens.
Use this information to accurately predict the masked spans.

6. Handling the Unchanged Span

For the span "the lazy" that was left unchanged, SpanBERT still trains on this span:

It uses the boundary tokens "over" and "dog".
The model learns to predict "the lazy" correctly, even though it's visible in the input.
This helps the model learn to use visible context effectively.

SpanBERT: Pros, Cons, and Key Features

Key Features

Span-based Masking: Instead of masking individual tokens, SpanBERT masks contiguous spans of text.
Span Boundary Objective (SBO): Trains the model to predict the entire content of a masked span using only the representations of the tokens at the span's boundary.
Single-Sequence Training: Trains on single contiguous segments of text, removing the next sentence prediction (NSP) task used in BERT.
Span-aware Self-attention: Modifies the self-attention mechanism to be aware of the span structure of the input.
Geometric Distribution for Span Length: Uses a geometric distribution to sample span lengths, favoring shorter spans while still including some longer ones.

Pros

Improved Performance on Span-based Tasks: Excels in tasks like question answering and coreference resolution, which often involve reasoning about spans of text.
Better Long-range Dependencies: The span-based approach helps the model capture relationships between words across longer distances in the text.
Enhanced Contextual Understanding: By predicting entire spans, the model develops a more nuanced understanding of context and word relationships.
Efficiency in Fine-tuning: Often requires less fine-tuning data to achieve good performance on downstream tasks.
Versatility: While specialized for span-based tasks, it maintains strong performance across a wide range of NLP tasks.
Improved Robustness: The span-based approach makes the model more robust to noise and variations in input text.
Better Handling of Entity-level Information: The span-based training is particularly beneficial for tasks involving named entities or multi-word expressions.

Cons

Increased Pre-training Complexity: The span-based approach and additional objective (SBO) make pre-training more complex compared to standard BERT.
Potential Overfitting to Span Structures: There's a risk of the model becoming overly reliant on span-based patterns, which might not be optimal for all types of NLP tasks.
Computationally Intensive: The additional objectives and modified attention mechanism can make training and inference more computationally expensive.
Less Effective for Token-level Tasks: While excelling at span-level tasks, it might not provide significant improvements for tasks that primarily operate at the token level.
Increased Model Size: The modifications often result in a larger model size compared to BERT, which can be a constraint in resource-limited environments.
Limited Multilingual Capabilities: The original SpanBERT was primarily developed and tested on English, and its effectiveness across multiple languages isn't as well-established as multilingual BERT variants.
Potential Data Bias: The span-based approach might introduce biases based on how spans are defined and selected during pre-training.