++++

Mar 2025×8 min read

Explore the inner workings of Claude's tokenizer. How does Anthropic's 65k vocabulary compare to GPT-4's 100k, and why does it matter for your prompts?

Claude's Tokenizer: The Anthropic Approach 🎭

Driptanil DattaSoftware Developer

Claude's Tokenizer: The Anthropic Approach 🎭

While OpenAI has standardized on cl100k_base, Anthropic uses a different vocabulary for Claude. By exploring Claude's tokenizer, we can see how different design choices affect token counts and model behavior.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. Vocabulary Size

The first major difference is the size of the "dictionary":

GPT-4: ~100,000 tokens
Claude: ~65,000 tokens

A smaller vocabulary means the model has fewer "mental slots" for words, which usually results in slightly higher token counts for the same text.

2. Leading Whitespace Efficiency ⌨️

One of the most interesting features of modern BPE tokenizers is how they handle the space before a word. In Claude's tokenizer, a word and its "spaced" version are often different tokens.

from transformers import GPT2TokenizerFast
 
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/claude-tokenizer')
 
word1 = "hypothetical"
word2 = " hypothetical" # Note the leading space
 
print(f"'{word1}': {tokenizer.encode(word1)}")
print(f"'{word2}': {tokenizer.encode(word2)}")

OUTPUT

'hypothetical': [30678, 36881] (2 tokens)
' hypothetical': [44086] (1 token)

In this specific case, adding a space actually reduced the token count! The tokenizer has a dedicated token for " space + hypothetical", but not for the word alone.

3. Tokenizing Code & Math 🧬

Tokenizers also vary in how they "chunk" technical symbols. Let's look at how Claude handles a Python slice:

code = "targetActs[layeri,:,:,1]"
toks = tokenizer.encode(code)
print([tokenizer.decode(t) for t in toks])

OUTPUT

['target', 'Acts', '[', 'layer', 'i', ',:,', ':,', '1', ']']

Notice how Claude has a specific token for ,:, and :,. This optimization makes it much more efficient at "reading" multidimensional array notation in libraries like NumPy or PyTorch.

💡 Summary: Why This Matters

Understanding these nuances helps you write more efficient prompts:

Format Matters: Small changes in spacing or punctuation can change your token cost.
Model Differences: A prompt that fits in Claude's context window might take up more (or less) space in GPT-4.
Efficiency: Claude's 65k vocabulary is highly tuned for English and common code patterns, making it a powerful alternative to OpenAI's larger vocabularies.

🎊 Section Complete!

You've made it through the entire first section of the AI & LLM Tokenization series! We've covered:

The transition from Text to Numbers.
The mechanics of Byte Pair Encoding (BPE).
The "Strawberry Problem" and LLM limitations.
Statistical laws like Zipf's Law.
The differences between GPT, BERT, and Claude.

In the next section, we'll dive into Embeddings: how these token IDs are turned into high-dimensional vectors that represent human meaning!

17. Zipf's Law GloVe Embeddings 🧤