++++Explore the inner workings of Claude's tokenizer. How does Anthropic's 65k vocabulary compare to GPT-4's 100k, and why does it matter for your prompts?
Claude's Tokenizer: The Anthropic Approach ๐ญ
Claude's Tokenizer: The Anthropic Approach ๐ญ
While OpenAI has standardized on cl100k_base, Anthropic uses a different vocabulary for Claude. By exploring Claude's tokenizer, we can see how different design choices affect token counts and model behavior.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. Vocabulary Size
The first major difference is the size of the "dictionary":
- GPT-4: ~100,000 tokens
- Claude: ~65,000 tokens
A smaller vocabulary means the model has fewer "mental slots" for words, which usually results in slightly higher token counts for the same text.
2. Leading Whitespace Efficiency โจ๏ธ
One of the most interesting features of modern BPE tokenizers is how they handle the space before a word. In Claude's tokenizer, a word and its "spaced" version are often different tokens.
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/claude-tokenizer')
word1 = "hypothetical"
word2 = " hypothetical" # Note the leading space
print(f"'{word1}': {tokenizer.encode(word1)}")
print(f"'{word2}': {tokenizer.encode(word2)}")'hypothetical': [30678, 36881] (2 tokens)
' hypothetical': [44086] (1 token)In this specific case, adding a space actually reduced the token count! The tokenizer has a dedicated token for " space + hypothetical", but not for the word alone.
3. Tokenizing Code & Math ๐งฌ
Tokenizers also vary in how they "chunk" technical symbols. Let's look at how Claude handles a Python slice:
code = "targetActs[layeri,:,:,1]"
toks = tokenizer.encode(code)
print([tokenizer.decode(t) for t in toks])['target', 'Acts', '[', 'layer', 'i', ',:,', ':,', '1', ']']Notice how Claude has a specific token for ,:, and :,. This optimization makes it much more efficient at "reading" multidimensional array notation in libraries like NumPy or PyTorch.
๐ก Summary: Why This Matters
Understanding these nuances helps you write more efficient prompts:
- Format Matters: Small changes in spacing or punctuation can change your token cost.
- Model Differences: A prompt that fits in Claude's context window might take up more (or less) space in GPT-4.
- Efficiency: Claude's 65k vocabulary is highly tuned for English and common code patterns, making it a powerful alternative to OpenAI's larger vocabularies.
๐ Section Complete!
You've made it through the entire first section of the AI & LLM Tokenization series! We've covered:
- The transition from Text to Numbers.
- The mechanics of Byte Pair Encoding (BPE).
- The "Strawberry Problem" and LLM limitations.
- Statistical laws like Zipf's Law.
- The differences between GPT, BERT, and Claude.
In the next section, we'll dive into Embeddings: how these token IDs are turned into high-dimensional vectors that represent human meaning!