Jun 306 min readTokenizationByte-Level BPEByte-Level BPE: Unicode-agnostic tokenization. Handles any character and out-of-vocabulary words. Balances efficiency and representation.
Jun 304 min readTokenizationSubword Regularization with BPEStochastic tokenization improving robustness. Applicable in BPE pre-training and fine-tuning.Balances consistency and variability.
Jun 304 min readTokenizationDynamic BPEDynamic BPE: Adaptive tokenization for pre-training and fine-tuning. Balances flexibility and consistency.
Jun 304 min readTokenizationBPE DropoutBPE Dropout: Stochastic subword segmentation. Applies dropout to merges during tokenization. Improves model robustness and generalization.
Jun 2811 min readTokenizationWordPiece Tokenization: A BPE VariantWord Piece Tokenization: Subword segmentation for NLP. Builds vocab from frequent subwords & handles rare words
Jun 277 min readTokenizationByte Pair Encoding: Cracking the Subword CodeByte Pair Tokenization: Efficient subword segmentation. Merges frequent character pairs, handles unseen words, scales to sentences.