++++How well does your tokenizer compress information? We compare the compression ratios of classic literature, modern websites, and raw code.
Token Compression: Efficiency Analysis 📉
Token Compression: Efficiency Analysis 📉
BPE is essentially a form of lossy-to-lossless compression. It tries to represent as much text as possible with the fewest number of tokens. But how efficient is it across different domains? We measure the Compression Ratio (Tokens / Characters) for various types of data.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. The Experiment
We'll take three types of data:
- Classic Books (Prose)
- Modern Websites (HTML/JSON/Text)
- String Constants (Letters/Digits/Symbols)
We'll use GPT-4's cl100k_base tokenizer for all tests.
2. Results: Classic Literature 📚
English prose is highly predictable, making it very "compressible" for BPE.
| Book Title | Chars | Tokens | Compression |
|---|---|---|---|
| Frankenstein | 446,583 | 102,397 | 22.93% |
| Romeo & Juliet | 167,470 | 43,738 | 26.12% |
| Edgar Allan Poe | 632,177 | 144,292 | 22.82% |
On average, English books require only ~0.23 tokens per character.
3. Results: Modern Websites 🌐
Websites are often "heavier" because they contain URLs, technical jargon, and mixed symbols.
| Website | Compression Ratio |
|---|---|
| python.org | 26.32% |
| sudoku.com | 36.31% |
| openai.com | 52.38% |
Why is OpenAI's site 52%? Much of their content includes technical documentation and specific product names that might not be as "dense" in the standard vocabulary as common words.
4. Results: Raw Symbols ⌨️
This is where the tokenizer's efficiency breaks down.
import string
import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
# Lowercase alphabet
text = string.ascii_lowercase
print(f"Lower Alphabet: {len(tokenizer.encode(text))} token(s) for 26 chars (3.8%)")
# Digits
text = string.digits
print(f"Digits: {len(tokenizer.encode(text))} tokens for 10 chars (40.0%)")
# Punctuation
text = string.punctuation
print(f"Punctuation: {len(tokenizer.encode(text))} tokens for 32 chars (65.6%)")Lower Alphabet: 1 token(s) for 26 chars (3.8%)
Digits: 4 tokens for 10 chars (40.0%)
Punctuation: 21 tokens for 32 chars (65.6%)💡 Key Findings
- Alphabet Density: GPT-4 has a single token for the entire lowercase alphabet string! That's massive compression (3.8%).
- Punctuation Penalty: Punctuation is "expensive." It takes 21 tokens to represent just 32 symbols (65%). This is why heavy formatting in prompts can quickly eat up your token limit.
- Domain Bias: BPE is tuned for natural language. The more your text looks like a "dictionary," the better the compression. The more it looks like "random noise" (or code symbols), the worse it gets.
💡 Summary
Compression ratio isn't just a technical metric—it's a proxy for how much the model "understands" the patterns in your data. Low compression means the model is seeing familiar patterns; high compression (approaching 1:1) means the model is struggling to find subword groups it recognizes.
Next, we'll look at how tokenization varies across different human languages!