++++

Mar 2025×6 min read

Put your tokenization knowledge to the test. We'll use token IDs and some 'creative arithmetic' to generate a unique rapper name from your real name.

Your Algorithmic Rapper Name 🎤

Driptanil DattaSoftware Developer

Your Algorithmic Rapper Name 🎤

Can we use the structure of a tokenizer to create something creative? We'll use GPT-4's token IDs to generate a unique "rapper name" based on your name and favorite color.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. The Setup

We'll start by taking a first name, last name, and favorite color, then converting the whole string into tokens.

import tiktoken
import numpy as np
import re
 
tokenizer = tiktoken.get_encoding('cl100k_base')
 
first_name = 'Mike X'
last_name = 'Cohen'
fav_color = 'purple'
 
text = ' '.join([first_name, last_name, fav_color])
token_ids = tokenizer.encode(text)
token_ids.sort() # Sort for consistency

2. The "Rapper Name" Algorithm 🧮

We'll use a few rules to pick three tokens that will form the stage name. We want the results to be "normal" (letters and numbers only), so we'll skip any weird special tokens.

def is_normal(token_str):
  # Check if token is just letters/numbers/spaces
  return bool(re.match(r'^[A-Za-z0-9\\s]+$', token_str))
 
# Part 1: First letter of first name - 1
p1 = tokenizer.encode(first_name[0])[0] - 1
while not is_normal(tokenizer.decode([p1])): p1 += 1
 
# Part 2: Last letter of last name + 1
p2 = tokenizer.encode(last_name[-1])[0] + 1
while not is_normal(tokenizer.decode([p2])): p2 += 1
 
# Part 3: Algorithmic mix (Last/First + average of middle)
p3 = int(token_ids[-1]/token_ids[0] + np.mean(token_ids[1:-1]))
while not is_normal(tokenizer.decode([p3])): p3 += 1
 
rapper_name = tokenizer.decode([p1, p2, p3])
print(f"My rapper name is: '{rapper_name}'")

OUTPUT

My rapper name is: 'Lo Limit'

3. The Logic Behind the Names

While this is a fun exercise, it highlights a few key properties of tokenizers:

Token IDs are just integers: We can do math on them (even if the math is nonsensical like "Last ID / First ID").
Vocabulary spans everything: The IDs cover single letters, common names, and abstract concepts.
The 'Normal' Filter: Much of LLM development involves filtering out "garbage" tokens or non-printable characters to keep outputs clean.

💡 Summary

Tokenization isn't just for models; it's a way to view text as a mathematical sequence. Whether you're building a chatbot or generating a stage name, the tokenizer is your gateway to the world of numerical text representation.

In the next section, we'll move beyond BPE and look at BERT Tokenization and how it differs from GPT!

9. The Strawberry Problem 11. BERT Tokenizer