++++

Mar 2025×9 min read

How well does your tokenizer compress information? We compare the compression ratios of classic literature, modern websites, and raw code.

Token Compression: Efficiency Analysis 📉

Driptanil DattaSoftware Developer

Token Compression: Efficiency Analysis 📉

BPE is essentially a form of lossy-to-lossless compression. It tries to represent as much text as possible with the fewest number of tokens. But how efficient is it across different domains? We measure the Compression Ratio (Tokens / Characters) for various types of data.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. The Experiment

We'll take three types of data:

Classic Books (Prose)
Modern Websites (HTML/JSON/Text)
String Constants (Letters/Digits/Symbols)

We'll use GPT-4's cl100k_base tokenizer for all tests.

2. Results: Classic Literature 📚

English prose is highly predictable, making it very "compressible" for BPE.

Book Title	Chars	Tokens	Compression
Frankenstein	446,583	102,397	22.93%
Romeo & Juliet	167,470	43,738	26.12%
Edgar Allan Poe	632,177	144,292	22.82%

On average, English books require only ~0.23 tokens per character.

3. Results: Modern Websites 🌐

Websites are often "heavier" because they contain URLs, technical jargon, and mixed symbols.

Website	Compression Ratio
python.org	26.32%
sudoku.com	36.31%
openai.com	52.38%

Why is OpenAI's site 52%? Much of their content includes technical documentation and specific product names that might not be as "dense" in the standard vocabulary as common words.

4. Results: Raw Symbols ⌨️

This is where the tokenizer's efficiency breaks down.

import string
import tiktoken
 
tokenizer = tiktoken.get_encoding('cl100k_base')
 
# Lowercase alphabet
text = string.ascii_lowercase
print(f"Lower Alphabet: {len(tokenizer.encode(text))} token(s) for 26 chars (3.8%)")
 
# Digits
text = string.digits
print(f"Digits: {len(tokenizer.encode(text))} tokens for 10 chars (40.0%)")
 
# Punctuation
text = string.punctuation
print(f"Punctuation: {len(tokenizer.encode(text))} tokens for 32 chars (65.6%)")

OUTPUT

Lower Alphabet: 1 token(s) for 26 chars (3.8%)
Digits: 4 tokens for 10 chars (40.0%)
Punctuation: 21 tokens for 32 chars (65.6%)

💡 Key Findings

Alphabet Density: GPT-4 has a single token for the entire lowercase alphabet string! That's massive compression (3.8%).
Punctuation Penalty: Punctuation is "expensive." It takes 21 tokens to represent just 32 symbols (65%). This is why heavy formatting in prompts can quickly eat up your token limit.
Domain Bias: BPE is tuned for natural language. The more your text looks like a "dictionary," the better the compression. The more it looks like "random noise" (or code symbols), the worse it gets.

💡 Summary

Compression ratio isn't just a technical metric—it's a proxy for how much the model "understands" the patterns in your data. Low compression means the model is seeing familiar patterns; high compression (approaching 1:1) means the model is struggling to find subword groups it recognizes.

Next, we'll look at how tokenization varies across different human languages!

14. Translators 16. Different Languages