🚀
5. Byte Pair Encoding (BPE) Concepts

5. Byte Pair Encoding (BPE) Concepts 🧬

Byte Pair Encoding (BPE) is a cornerstone of modern NLP. It's a subword tokenization method that allows models to represent rare words by breaking them down into common character sequences.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🏗️ 1. Why BPE?

Traditional tokenization (splitting by whitespace) has two major flaws:

  1. Vocabulary Size: Every unique word needs its own ID. This lead to massive vocabularies.
  2. Out-of-Vocabulary (OOV): If the model sees a word it wasn't trained on (e.g., "liker"), it fails.

BPE solves this by iteratively merging the most frequent character pairs into new tokens.


🛠️ 2. The Step-by-Step Process

Let's walk through a manual iteration using a toy dataset.

import numpy as np
 
# A string with many repetitions
text = 'like liker love lovely hug hugs hugging hearts'
 
# Initial vocabulary: All unique characters
chars = sorted(list(set(text)))
vocab = { char: i for i, char in enumerate(chars) }
 
# Text must be a list of tokens (initially characters)
tokens = list(text)
 
print(f"Initial sequence: {tokens[:10]}...")

Finding the Most Frequent Pair

We count every pair of adjacent tokens in our sequence.

token_pairs = {}
for i in range(len(tokens)-1):
    pair = tokens[i] + tokens[i+1]
    token_pairs[pair] = token_pairs.get(pair, 0) + 1
 
# Find the winner
most_freq_pair = max(token_pairs, key=token_pairs.get)
print(f'Most frequent pair: "{most_freq_pair}" ({token_pairs[most_freq_pair]} times)')

Output:

Most frequent pair: " h" (4 times)

Updating the Vocabulary

Now we add this new "merged" token to our vocabulary.

vocab[most_freq_pair] = max(vocab.values()) + 1
print(f"New Vocabulary size: {len(vocab)}")

🔄 3. Merging in the Text

Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h').

new_text = []
i = 0
while i < (len(tokens) - 1):
    if (tokens[i] + tokens[i+1]) == most_freq_pair:
        new_text.append(most_freq_pair)
        i += 2 # Skip the next character
    else:
        new_text.append(tokens[i])
        i += 1
if i < len(tokens):
    new_text.append(tokens[i])
 
print(f"Original length: {len(tokens)}")
print(f"New length:      {len(new_text)}")

Output:

Original length: 46
New length:      42

💡 What's Next?

In the next lesson, we'll automate this process with a loop to build a vocabulary of any size!

© 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love ❤️ | Last updated: Mar 16 2026