XLNet: Permutation Language Modeling

Overview

Permutation Language Modeling (PLM) is the core concept behind XLNet, a powerful language model introduced as an alternative to BERT. PLM aims to combine the strengths of autoregressive (AR) and autoencoding (AE) language models while addressing their limitations.

Background

To understand PLM, it's helpful to first recall two types of language models:

Autoregressive (AR) models (e.g., GPT):
1. Predict the next token based on all previous tokens.
2. Preserve the original order of the sequence.
3. Cannot capture bidirectional context.
Autoencoding (AE) models (e.g., BERT):
1. Predict masked tokens using bidirectional context.
2. Can capture bidirectional context.
3. Suffer from pretrain-finetune discrepancy due to [MASK] tokens.

Permutation Language Modeling: Core Idea

PLM aims to predict a token using only the tokens that come before it in a given permutation order, while considering all possible permutations of the sequence order during training. This approach:

Captures bidirectional context.
Avoids using [MASK] tokens.
Maintains the autoregressive property.

XLNet: The Paradox in Permutation Language Modeling

In permutation language modeling, we face a paradox when trying to predict tokens:

We need to know the content of a token to calculate attention for other tokens.
We can't know the content of the token we're trying to predict.

This creates a conflict between maintaining the autoregressive property (predicting based only on previous tokens) and allowing the model to know which position it's predicting.

Example to Illustrate the Paradox

Let's use the sentence: "The cat sat"

Suppose we're using the permutation [2, 3, 1], which corresponds to ["cat", "sat", "The"].

Step 1: Predicting "cat" (easy case)
1. We're at the first position in our permutation.
2. No paradox here: we simply predict "cat" without any context.
Step 2: Predicting "sat" (the paradox begins)
1. We're at the second position in our permutation.
2. We want to use "cat" as context to predict "sat".
3. Paradox:
4. We need to know we're predicting the second position (originally "sat").
5. But if we encode this position information, the model would "know" it's predicting "sat", violating the autoregressive property.
Step 3: Predicting "The" (the paradox in full effect)
1. We're at the third position in our permutation, but it's actually the first word "The".
2. We want to use "cat" and "sat" as context to predict "The".
3. Paradox in full:
  1. We need to know we're predicting the first position (originally "The").
  2. We need to use "cat" and "sat" as context.
  3. But if we encode the position as 1, the model would "know" it's predicting "The".
  4. If we don't encode the position, the model doesn't know which word it's trying to predict.

The Problem This Creates

Without resolving this paradox:

If we include position information, the model cheats by knowing what word it's supposed to predict.
If we don't include position information, the model doesn't know which position in the original sequence it's predicting, losing crucial information.

How Two-Stream Attention Solves This

XLNet's two-stream attention mechanism resolves this paradox:

Content Stream:
1. Represents the actual content at each position.
2. For "The": Attends to "cat", "sat", and "The".
3. Provides full context for other predictions.
Query Stream:
1. Used for prediction, doesn't see the current token.
2. For "The": Attends only to "cat" and "sat".
3. Knows it's predicting the first position but doesn't see "The".
When predicting "The":
1. Use the query stream, which knows it's predicting position 1 but only has information about "cat" and "sat".
2. Use the content stream representations of "cat" and "sat" for context.
3. This allows prediction of "The" using relevant context and position information, without cheating by seeing "The" itself.
This mechanism allows XLNet to:
1. Maintain the autoregressive property.
2. Know which position it's predicting.
3. Use bidirectional context through different permutations.

By separating "what we're predicting" (query stream) from "what we know" (content stream), XLNet resolves the paradox and enables effective permutation language modeling.

XLNet: Permutation Language Modeling Detailed Working with Example

Let's break down the working of Permutation Language Modeling (PLM) in XLNet step by step, and then illustrate it with an example.

Detailed Working Process

Input Sequence:
1. Start with an input sequence of tokens.
Permutations:
1. Consider all possible permutations of the sequence order.
2. In practice, for efficiency, a subset of permutations is sampled randomly.
Factorization Order:
1. For each permutation, create a factorization order.
2. This order determines which tokens are available for predicting each target token.
Target Tokens:
1. Select a subset of tokens (usually the last few) in the factorization order as target tokens to be predicted.
Two-Stream Attention:
1. Content Stream: Represents the content at each position, can attend to itself and previous tokens.
2. Query Stream: Used for predicting the token at each position, can only attend to previous tokens.
Attention Masking:
1. Use special attention masks to ensure each token only attends to the appropriate previous tokens in the current permutation.
Token Prediction:
1. For each target token, predict it using only the tokens before it in the factorization order.
2. Use the query stream for this prediction.
Loss Calculation:
1. Calculate the negative log-likelihood of the correct tokens.
2. Average the loss over all permutations and target tokens.
Optimization:
1. Update the model parameters to minimize this loss.

Example

Let's walk through an example with a simple sentence: "The cat sat"

Input Sequence: ["The", "cat", "sat"]
Possible Permutations (3! = 6 total):
1. ["The", "cat", "sat"]
2. ["The", "sat", "cat"]
3. ["cat", "The", "sat"]
4. ["cat", "sat", "The"]
5. ["sat", "The", "cat"]
6. ["sat", "cat", "The"]
Factorization Orders: Let's consider two of these permutations for our example:
1. Permutation 1: ["The", "cat", "sat"]
2. Permutation 4: ["cat", "sat", "The"]
Target Tokens: Let's say we're predicting the last two tokens in each permutation.
Attention Mechanism:
1. For Permutation 1: ["The", "cat", "sat"]
2. Predicting "cat":
  1. Content stream can attend to: ["The", "cat"]
  2. Query stream can attend to: ["The"]
3. Predicting "sat":
  1. Content stream can attend to: ["The", "cat", "sat"]
  2. Query stream can attend to: ["The", "cat"]
4. For Permutation 4: ["cat", "sat", "The"]
  1. Predicting "sat":
    1. Content stream can attend to: ["cat", "sat"]
    2. Query stream can attend to: ["cat"]
  2. Predicting "The":
    1. Content stream can attend to: ["cat", "sat", "The"]
    2. Query stream can attend to: ["cat", "sat"]
Token Prediction:
1. For Permutation 1:
  1. Predict "cat" given "The"
  2. Predict "sat" given "The cat"
2. For Permutation 4:
  1. Predict "sat" given "cat"
  2. Predict "The" given "cat sat"
Loss Calculation: Calculate the negative log-likelihood of the correct predictions for each target token in each permutation, then average these losses.
Training: Update the model parameters to minimize this average loss across all permutations and all sequences in the training data.

By training on different permutations, XLNet learns to use whatever context is available to make predictions, regardless of the order. This allows it to capture bidirectional context without relying on explicit mask tokens like BERT does.

Permutation Language Modeling: Pros and Cons

Pros

Bidirectional Context
1. PLM can capture dependencies from both directions without using mask tokens.
2. This leads to a richer understanding of context compared to unidirectional models.
No Pretrain-Finetune Discrepancy
1. Unlike BERT, which uses [MASK] tokens during pretraining but not during finetuning, PLM maintains consistency between these phases.
2. This can lead to better performance on downstream tasks.
Captures Long-Range Dependencies
1. By considering various permutations, PLM can model complex, long-range dependencies in the data.
2. This is particularly beneficial for tasks requiring understanding of broader context.
Retains Autoregressive Property
1. Unlike BERT, PLM retains the autoregressive property, making it suitable for generation tasks.
2. This makes XLNet more versatile, capable of both understanding and generation tasks.
Overcomes Independence Assumption
1. BERT assumes masked tokens are independent of each other, which isn't always true.
2. PLM avoids this assumption, potentially leading to more accurate predictions.
Enhanced Performance
1. XLNet has shown superior performance on various NLP benchmarks compared to BERT and other models.
Incorporation of Transformer-XL
1. PLM in XLNet incorporates ideas from Transformer-XL, allowing it to handle even longer sequences effectively.

Cons

Computational Complexity
1. Considering all permutations of a sequence can be computationally expensive.
2. This leads to longer training times and higher resource requirements compared to models like BERT.
Implementation Complexity
1. The two-stream attention mechanism and permutation-based training make XLNet more complex to implement and debug.
Potential Overfitting
1. With increased model complexity comes the risk of overfitting, especially on smaller datasets.
Difficulty in Interpretation
1. The permutation-based approach can make it more challenging to interpret the model's decisions compared to simpler models.
Training Instability
1. The complex training objective can sometimes lead to instability during the training process.
Resource Intensive
1. Due to its complexity, XLNet often requires more memory and computational resources than BERT for the same model size.
Less Intuitive
1. The concept of permutation-based training is less intuitive than masked language modeling, potentially making it harder for practitioners to understand and adapt.
Limited Improvement on Some Tasks
1. While XLNet shows significant improvements on many tasks, the gains are less pronounced on some simpler NLP tasks.
Harder to Parallelize
1. The autoregressive nature of PLM makes it more challenging to parallelize certain computations compared to BERT's masked language modeling.

In conclusion, while Permutation Language Modeling offers several significant advantages in terms of modeling power and versatility, it also comes with increased complexity and computational demands. The choice between PLM and other approaches often depends on the specific requirements of the task, available computational resources, and the need for model interpretability.