5. Byte Pair Encoding (BPE) Concepts 🧬
Byte Pair Encoding (BPE) is a cornerstone of modern NLP. It's a subword tokenization method that allows models to represent rare words by breaking them down into common character sequences.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🏗️ 1. Why BPE?
Traditional tokenization (splitting by whitespace) has two major flaws:
- Vocabulary Size: Every unique word needs its own ID. This lead to massive vocabularies.
- Out-of-Vocabulary (OOV): If the model sees a word it wasn't trained on (e.g., "liker"), it fails.
BPE solves this by iteratively merging the most frequent character pairs into new tokens.
🛠️ 2. The Step-by-Step Process
Let's walk through a manual iteration using a toy dataset.
import numpy as np
# A string with many repetitions
text = 'like liker love lovely hug hugs hugging hearts'
# Initial vocabulary: All unique characters
chars = sorted(list(set(text)))
vocab = { char: i for i, char in enumerate(chars) }
# Text must be a list of tokens (initially characters)
tokens = list(text)
print(f"Initial sequence: {tokens[:10]}...")Finding the Most Frequent Pair
We count every pair of adjacent tokens in our sequence.
token_pairs = {}
for i in range(len(tokens)-1):
pair = tokens[i] + tokens[i+1]
token_pairs[pair] = token_pairs.get(pair, 0) + 1
# Find the winner
most_freq_pair = max(token_pairs, key=token_pairs.get)
print(f'Most frequent pair: "{most_freq_pair}" ({token_pairs[most_freq_pair]} times)')Output:
Most frequent pair: " h" (4 times)
Updating the Vocabulary
Now we add this new "merged" token to our vocabulary.
vocab[most_freq_pair] = max(vocab.values()) + 1
print(f"New Vocabulary size: {len(vocab)}")🔄 3. Merging in the Text
Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h').
new_text = []
i = 0
while i < (len(tokens) - 1):
if (tokens[i] + tokens[i+1]) == most_freq_pair:
new_text.append(most_freq_pair)
i += 2 # Skip the next character
else:
new_text.append(tokens[i])
i += 1
if i < len(tokens):
new_text.append(tokens[i])
print(f"Original length: {len(tokens)}")
print(f"New length: {len(new_text)}")Output:
Original length: 46 New length: 42
💡 What's Next?
In the next lesson, we'll automate this process with a loop to build a vocabulary of any size!