++++Explore the classic LLM limitation. Learn why tokenization makes simple character counting difficult for even the most advanced AI models.
Why Models Can't Count 'r's in Strawberry 🍓
Why Models Can't Count 'r's in Strawberry 🍓
If you ask an LLM, "How many 'r's are in the word strawberry?", it often confidently answers "two." But there are clearly three. Is the model stupid? No—it's just "blind" to characters. This is the Strawberry Problem, and it's a direct consequence of tokenization.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. What the Model "Sees"
When we give the word "strawberry" to GPT-4, it doesn't see a sequence of 10 letters. It sees a sequence of 3 tokens.
import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
# Encode 'strawberry'
tokens = tokenizer.encode('strawberry')
for t in tokens:
print(f"ID {t:5d} -> '{tokenizer.decode([t])}'")ID 496 -> 'str'
ID 675 -> 'aw'
ID 15717 -> 'berry'The model's "reality" is the IDs [496, 675, 15717].
2. The Counting Disconnect
Now, let's look for the letter 'r'. In GPT-4's vocabulary, the standalone letter 'r' is ID 81.
r_token = tokenizer.encode('r')
print(f"ID for 'r': {r_token}")
# Is 'r' inside the 'strawberry' tokens?
print(f"Is 81 in [496, 675, 15717]? {81 in tokens}")ID for 'r': [81]
Is 81 in [496, 675, 15717]? FalseTo the model, there are zero 'r' tokens in "strawberry." It only knows that the concept of a strawberry is composed of three specific subword chunks. It doesn't inherently know that the chunk 'berry' contains two 'r's unless it was specifically trained on character-level relationships.
3. How to Fix It (in code)
To get the correct answer, we have to force the model (or our script) to convert the tokens back into a character string before counting.
# Decode the tokens back to a string
strawberry_str = tokenizer.decode(tokens)
# Count using standard string methods
count = strawberry_str.count('r')
print(f"Actual 'r' count: {count}")Actual 'r' count: 3💡 The Lesson
This isn't just about fruit. This limitation affects:
- Spelling: Models struggle with complex spelling tasks.
- Math: Numbers are often tokenized in chunks (e.g.,
123might be one token, while1234is two), leading to arithmetic errors. - Code: Indentation and variable names are sensitive to how BPE merges them.
This is why "Chain of Thought" prompting helps. By asking the model to "spell the word out letter by letter first," you force it to generate character tokens, which makes the counting task trivial!
💡 Summary
LLMs operate on subwords, not characters. The "Strawberry Problem" is a perfect reminder that AI models don't perceive the world (or text) the same way humans do.