++++

Mar 2025×7 min read

Explore the classic LLM limitation. Learn why tokenization makes simple character counting difficult for even the most advanced AI models.

Why Models Can't Count 'r's in Strawberry 🍓

Driptanil DattaSoftware Developer

Why Models Can't Count 'r's in Strawberry 🍓

If you ask an LLM, "How many 'r's are in the word strawberry?", it often confidently answers "two." But there are clearly three. Is the model stupid? No—it's just "blind" to characters. This is the Strawberry Problem, and it's a direct consequence of tokenization.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. What the Model "Sees"

When we give the word "strawberry" to GPT-4, it doesn't see a sequence of 10 letters. It sees a sequence of 3 tokens.

import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
 
# Encode 'strawberry'
tokens = tokenizer.encode('strawberry')
 
for t in tokens:
  print(f"ID {t:5d} -> '{tokenizer.decode([t])}'")

OUTPUT

ID   496 -> 'str'
ID   675 -> 'aw'
ID 15717 -> 'berry'

The model's "reality" is the IDs [496, 675, 15717].

2. The Counting Disconnect

Now, let's look for the letter 'r'. In GPT-4's vocabulary, the standalone letter 'r' is ID 81.

r_token = tokenizer.encode('r')
print(f"ID for 'r': {r_token}")
 
# Is 'r' inside the 'strawberry' tokens?
print(f"Is 81 in [496, 675, 15717]? {81 in tokens}")

OUTPUT

ID for 'r': [81]
Is 81 in [496, 675, 15717]? False

To the model, there are zero 'r' tokens in "strawberry." It only knows that the concept of a strawberry is composed of three specific subword chunks. It doesn't inherently know that the chunk 'berry' contains two 'r's unless it was specifically trained on character-level relationships.

3. How to Fix It (in code)

To get the correct answer, we have to force the model (or our script) to convert the tokens back into a character string before counting.

# Decode the tokens back to a string
strawberry_str = tokenizer.decode(tokens)
 
# Count using standard string methods
count = strawberry_str.count('r')
print(f"Actual 'r' count: {count}")

OUTPUT

Actual 'r' count: 3

💡 The Lesson

This isn't just about fruit. This limitation affects:

Spelling: Models struggle with complex spelling tasks.
Math: Numbers are often tokenized in chunks (e.g., 123 might be one token, while 1234 is two), leading to arithmetic errors.
Code: Indentation and variable names are sensitive to how BPE merges them.

This is why "Chain of Thought" prompting helps. By asking the model to "spell the word out letter by letter first," you force it to generate character tokens, which makes the counting task trivial!

💡 Summary

LLMs operate on subwords, not characters. The "Strawberry Problem" is a perfect reminder that AI models don't perceive the world (or text) the same way humans do.

8. Token Efficiency 10. Rapper Names