Coding Challenge: Byte Pair Encoding (BPE) 🧬
In this challenge, we implement a manual loop for Byte Pair Encoding (BPE). BPE is a subword tokenization method that iteratively merges the most frequent pairs of characters (or tokens) into new, larger tokens.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🏗️ 1. Initialize the Vocabulary
We start with a simple string with many repetitions to see how BPE discovers patterns like "like" and "love".
import numpy as np
# some text with lots of repetitions
text = 'like liker love lovely hug hugs hugging hearts'
chars = list(set(text))
chars.sort() # initial vocab is sorted
for l in chars:
print(f'"{l}" appears {text.count(l)} times.')
# make a vocabulary
vocab = { word:i for i,word in enumerate(chars) }
print(f"\nInitial Vocabulary: {vocab}")
# the text needs to be a list, not a string
# each element in the list is a token
origtext = list(text)
print(f"\nOriginal tokens: {origtext}")Output:
" " appears 7 times. "a" appears 1 times. "e" appears 5 times. "g" appears 5 times. "h" appears 4 times. "i" appears 3 times. "k" appears 2 times. "l" appears 5 times. "n" appears 1 times. "o" appears 2 times. "r" appears 2 times. "s" appears 2 times. "t" appears 1 times. "u" appears 3 times. "v" appears 2 times. "y" appears 1 times. Initial Vocabulary: {' ': 0, 'a': 1, 'e': 2, 'g': 3, 'h': 4, 'i': 5, 'k': 6, 'l': 7, 'n': 8, 'o': 9, 'r': 10, 's': 11, 't': 12, 'u': 13, 'v': 14, 'y': 15}
🛠️ 2. Core Functions
We define three functions to handle the BPE process: counting pairs, updating the vocabulary, and merging tokens in the sequence.
def get_pair_stats(text2pair):
token_pairs = dict()
# loop over tokens
for i in range(len(text2pair)-1):
# create a pair
pair = text2pair[i] + text2pair[i+1]
# increase pair frequencies
if pair in token_pairs:
token_pairs[pair] += 1
else:
token_pairs[pair] = 1
return token_pairs
def update_vocab(token_pairs,vocab):
# find the most frequent pair
idx = np.argmax(list(token_pairs.values()))
newtok = list(token_pairs.keys())[idx]
# update the vocab
vocab[newtok] = max(vocab.values())+1
return vocab,newtok
def generate_new_token_seq(prevtext,newtoken):
# initialize a new text list
newtext = []
# loop through the list
i = 0
while i<(len(prevtext)-1):
# test whether the pair of this and the following elements match the newly-created pair
if (prevtext[i]+prevtext[i+1]) == newtoken:
newtext.append(newtoken)
i += 2 # skip the next character
# not a pair
else:
newtext.append(prevtext[i])
i += 1 # move to the next character
if i < len(prevtext):
newtext.append(prevtext[i])
return newtext🔄 3. The BPE Loop
Now we run the main loop to expand the vocabulary from 16 initial characters to 25 tokens.
# re-initialize the vocab
vocab = { word:i for i,word in enumerate(chars) }
# how many tokens in the vocab?
vocab_size = 25
# make a copy of the original text
updated_text = origtext.copy()
while len(vocab)<vocab_size:
# get pair statistics
pairs = get_pair_stats(updated_text)
# update the dictionary
vocab,newtoken = update_vocab(pairs,vocab)
# get a new list of tokens
updated_text = generate_new_token_seq(updated_text,newtoken)
print(f'Vocab has {len(vocab)} tokens after adding "{newtoken}"')Output:
Vocab has 17 tokens after adding " h" Vocab has 18 tokens after adding " l" Vocab has 19 tokens after adding " hu" Vocab has 20 tokens after adding " hug" Vocab has 21 tokens after adding "ik" Vocab has 22 tokens after adding "ike" Vocab has 23 tokens after adding " lo" Vocab has 24 tokens after adding " lov" Vocab has 25 tokens after adding " love"
🏁 4. Final Results
Finally, let's look at the compressed token sequence and the final vocabulary.
# final tokenized text
print(f"Final sequence: {updated_text}")
# final vocabulary size and sample
print(f"Final vocab size: {len(vocab)}")
print(f"Final vocab: {vocab}")Output:
Final sequence: ['l', 'ike', ' l', 'ike', 'r', ' love', ' love', 'l', 'y', ' hug', ' hug', 's', ' hug', 'g', 'i', 'n', 'g', ' h', 'e', 'a', 'r', 't', 's'] Final vocab size: 25
💡 Summary
By the end of this challenge, we've seen how BPE successfully merged characters into meaningful subwords like love, hug, and ike. This is the exact mechanism that enables Large Language Models to handle a vast range of languages and technical jargon with a relatively small, fixed-size vocabulary.