🚀
6. Coding Challenge: BPE Loop

Coding Challenge: Byte Pair Encoding (BPE) 🧬

In this challenge, we implement a manual loop for Byte Pair Encoding (BPE). BPE is a subword tokenization method that iteratively merges the most frequent pairs of characters (or tokens) into new, larger tokens.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🏗️ 1. Initialize the Vocabulary

We start with a simple string with many repetitions to see how BPE discovers patterns like "like" and "love".

import numpy as np
 
# some text with lots of repetitions
text = 'like liker love lovely hug hugs hugging hearts'
 
chars = list(set(text))
chars.sort() # initial vocab is sorted
 
for l in chars:
  print(f'"{l}" appears {text.count(l)} times.')
 
# make a vocabulary
vocab = { word:i for i,word in enumerate(chars) }
print(f"\nInitial Vocabulary: {vocab}")
 
# the text needs to be a list, not a string
# each element in the list is a token
origtext = list(text)
print(f"\nOriginal tokens: {origtext}")

Output:

" " appears 7 times.
"a" appears 1 times.
"e" appears 5 times.
"g" appears 5 times.
"h" appears 4 times.
"i" appears 3 times.
"k" appears 2 times.
"l" appears 5 times.
"n" appears 1 times.
"o" appears 2 times.
"r" appears 2 times.
"s" appears 2 times.
"t" appears 1 times.
"u" appears 3 times.
"v" appears 2 times.
"y" appears 1 times.

Initial Vocabulary: {' ': 0, 'a': 1, 'e': 2, 'g': 3, 'h': 4, 'i': 5, 'k': 6, 'l': 7, 'n': 8, 'o': 9, 'r': 10, 's': 11, 't': 12, 'u': 13, 'v': 14, 'y': 15}

🛠️ 2. Core Functions

We define three functions to handle the BPE process: counting pairs, updating the vocabulary, and merging tokens in the sequence.

def get_pair_stats(text2pair):
  token_pairs = dict()
 
  # loop over tokens
  for i in range(len(text2pair)-1):
 
    # create a pair
    pair = text2pair[i] + text2pair[i+1]
 
    # increase pair frequencies
    if pair in token_pairs:
      token_pairs[pair] += 1
    else:
      token_pairs[pair] = 1
 
  return token_pairs
 
def update_vocab(token_pairs,vocab):
 
  # find the most frequent pair
  idx = np.argmax(list(token_pairs.values()))
  newtok = list(token_pairs.keys())[idx]
 
  # update the vocab
  vocab[newtok] = max(vocab.values())+1
  return vocab,newtok
 
def generate_new_token_seq(prevtext,newtoken):
 
  # initialize a new text list
  newtext = []
 
  # loop through the list
  i = 0
  while i<(len(prevtext)-1):
 
    # test whether the pair of this and the following elements match the newly-created pair
    if (prevtext[i]+prevtext[i+1]) == newtoken:
      newtext.append(newtoken)
      i += 2 # skip the next character
 
    # not a pair
    else:
      newtext.append(prevtext[i])
      i += 1 # move to the next character
 
  if i < len(prevtext):
    newtext.append(prevtext[i])
 
  return newtext

🔄 3. The BPE Loop

Now we run the main loop to expand the vocabulary from 16 initial characters to 25 tokens.

# re-initialize the vocab
vocab = { word:i for i,word in enumerate(chars) }
 
# how many tokens in the vocab?
vocab_size = 25
 
# make a copy of the original text
updated_text = origtext.copy()
 
while len(vocab)<vocab_size:
 
  # get pair statistics
  pairs = get_pair_stats(updated_text)
 
  # update the dictionary
  vocab,newtoken = update_vocab(pairs,vocab)
 
  # get a new list of tokens
  updated_text = generate_new_token_seq(updated_text,newtoken)
  print(f'Vocab has {len(vocab)} tokens after adding "{newtoken}"')

Output:

Vocab has 17 tokens after adding " h"
Vocab has 18 tokens after adding " l"
Vocab has 19 tokens after adding " hu"
Vocab has 20 tokens after adding " hug"
Vocab has 21 tokens after adding "ik"
Vocab has 22 tokens after adding "ike"
Vocab has 23 tokens after adding " lo"
Vocab has 24 tokens after adding " lov"
Vocab has 25 tokens after adding " love"

🏁 4. Final Results

Finally, let's look at the compressed token sequence and the final vocabulary.

# final tokenized text
print(f"Final sequence: {updated_text}")
 
# final vocabulary size and sample
print(f"Final vocab size: {len(vocab)}")
print(f"Final vocab: {vocab}")

Output:

Final sequence: ['l', 'ike', ' l', 'ike', 'r', ' love', ' love', 'l', 'y', ' hug', ' hug', 's', ' hug', 'g', 'i', 'n', 'g', ' h', 'e', 'a', 'r', 't', 's']

Final vocab size: 25

💡 Summary

By the end of this challenge, we've seen how BPE successfully merged characters into meaningful subwords like love, hug, and ike. This is the exact mechanism that enables Large Language Models to handle a vast range of languages and technical jargon with a relatively small, fixed-size vocabulary.

© 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love ❤️ | Last updated: Mar 16 2026