++++
AI
Mar 2025×6 min read

Put your tokenization knowledge to the test. We'll use token IDs and some 'creative arithmetic' to generate a unique rapper name from your real name.

Your Algorithmic Rapper Name 🎤

Driptanil Datta
Driptanil DattaSoftware Developer

Your Algorithmic Rapper Name 🎤

Can we use the structure of a tokenizer to create something creative? We'll use GPT-4's token IDs to generate a unique "rapper name" based on your name and favorite color.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. The Setup

We'll start by taking a first name, last name, and favorite color, then converting the whole string into tokens.

import tiktoken
import numpy as np
import re
 
tokenizer = tiktoken.get_encoding('cl100k_base')
 
first_name = 'Mike X'
last_name = 'Cohen'
fav_color = 'purple'
 
text = ' '.join([first_name, last_name, fav_color])
token_ids = tokenizer.encode(text)
token_ids.sort() # Sort for consistency

2. The "Rapper Name" Algorithm 🧮

We'll use a few rules to pick three tokens that will form the stage name. We want the results to be "normal" (letters and numbers only), so we'll skip any weird special tokens.

def is_normal(token_str):
  # Check if token is just letters/numbers/spaces
  return bool(re.match(r'^[A-Za-z0-9\\s]+$', token_str))
 
# Part 1: First letter of first name - 1
p1 = tokenizer.encode(first_name[0])[0] - 1
while not is_normal(tokenizer.decode([p1])): p1 += 1
 
# Part 2: Last letter of last name + 1
p2 = tokenizer.encode(last_name[-1])[0] + 1
while not is_normal(tokenizer.decode([p2])): p2 += 1
 
# Part 3: Algorithmic mix (Last/First + average of middle)
p3 = int(token_ids[-1]/token_ids[0] + np.mean(token_ids[1:-1]))
while not is_normal(tokenizer.decode([p3])): p3 += 1
 
rapper_name = tokenizer.decode([p1, p2, p3])
print(f"My rapper name is: '{rapper_name}'")
OUTPUT
My rapper name is: 'Lo Limit'

3. The Logic Behind the Names

While this is a fun exercise, it highlights a few key properties of tokenizers:

  1. Token IDs are just integers: We can do math on them (even if the math is nonsensical like "Last ID / First ID").
  2. Vocabulary spans everything: The IDs cover single letters, common names, and abstract concepts.
  3. The 'Normal' Filter: Much of LLM development involves filtering out "garbage" tokens or non-printable characters to keep outputs clean.

💡 Summary

Tokenization isn't just for models; it's a way to view text as a mathematical sequence. Whether you're building a chatbot or generating a stage name, the tokenizer is your gateway to the world of numerical text representation.

In the next section, we'll move beyond BPE and look at BERT Tokenization and how it differs from GPT!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.