++++Put your tokenization knowledge to the test. We'll use token IDs and some 'creative arithmetic' to generate a unique rapper name from your real name.
Your Algorithmic Rapper Name 🎤
Your Algorithmic Rapper Name 🎤
Can we use the structure of a tokenizer to create something creative? We'll use GPT-4's token IDs to generate a unique "rapper name" based on your name and favorite color.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. The Setup
We'll start by taking a first name, last name, and favorite color, then converting the whole string into tokens.
import tiktoken
import numpy as np
import re
tokenizer = tiktoken.get_encoding('cl100k_base')
first_name = 'Mike X'
last_name = 'Cohen'
fav_color = 'purple'
text = ' '.join([first_name, last_name, fav_color])
token_ids = tokenizer.encode(text)
token_ids.sort() # Sort for consistency2. The "Rapper Name" Algorithm 🧮
We'll use a few rules to pick three tokens that will form the stage name. We want the results to be "normal" (letters and numbers only), so we'll skip any weird special tokens.
def is_normal(token_str):
# Check if token is just letters/numbers/spaces
return bool(re.match(r'^[A-Za-z0-9\\s]+$', token_str))
# Part 1: First letter of first name - 1
p1 = tokenizer.encode(first_name[0])[0] - 1
while not is_normal(tokenizer.decode([p1])): p1 += 1
# Part 2: Last letter of last name + 1
p2 = tokenizer.encode(last_name[-1])[0] + 1
while not is_normal(tokenizer.decode([p2])): p2 += 1
# Part 3: Algorithmic mix (Last/First + average of middle)
p3 = int(token_ids[-1]/token_ids[0] + np.mean(token_ids[1:-1]))
while not is_normal(tokenizer.decode([p3])): p3 += 1
rapper_name = tokenizer.decode([p1, p2, p3])
print(f"My rapper name is: '{rapper_name}'")My rapper name is: 'Lo Limit'3. The Logic Behind the Names
While this is a fun exercise, it highlights a few key properties of tokenizers:
- Token IDs are just integers: We can do math on them (even if the math is nonsensical like "Last ID / First ID").
- Vocabulary spans everything: The IDs cover single letters, common names, and abstract concepts.
- The 'Normal' Filter: Much of LLM development involves filtering out "garbage" tokens or non-printable characters to keep outputs clean.
💡 Summary
Tokenization isn't just for models; it's a way to view text as a mathematical sequence. Whether you're building a chatbot or generating a stage name, the tokenizer is your gateway to the world of numerical text representation.
In the next section, we'll move beyond BPE and look at BERT Tokenization and how it differs from GPT!