++++
AI
Mar 2025×9 min read

How well does your tokenizer compress information? We compare the compression ratios of classic literature, modern websites, and raw code.

Token Compression: Efficiency Analysis 📉

Driptanil Datta
Driptanil DattaSoftware Developer

Token Compression: Efficiency Analysis 📉

BPE is essentially a form of lossy-to-lossless compression. It tries to represent as much text as possible with the fewest number of tokens. But how efficient is it across different domains? We measure the Compression Ratio (Tokens / Characters) for various types of data.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. The Experiment

We'll take three types of data:

  1. Classic Books (Prose)
  2. Modern Websites (HTML/JSON/Text)
  3. String Constants (Letters/Digits/Symbols)

We'll use GPT-4's cl100k_base tokenizer for all tests.


2. Results: Classic Literature 📚

English prose is highly predictable, making it very "compressible" for BPE.

Book TitleCharsTokensCompression
Frankenstein446,583102,39722.93%
Romeo & Juliet167,47043,73826.12%
Edgar Allan Poe632,177144,29222.82%

On average, English books require only ~0.23 tokens per character.


3. Results: Modern Websites 🌐

Websites are often "heavier" because they contain URLs, technical jargon, and mixed symbols.

WebsiteCompression Ratio
python.org26.32%
sudoku.com36.31%
openai.com52.38%

Why is OpenAI's site 52%? Much of their content includes technical documentation and specific product names that might not be as "dense" in the standard vocabulary as common words.


4. Results: Raw Symbols ⌨️

This is where the tokenizer's efficiency breaks down.

import string
import tiktoken
 
tokenizer = tiktoken.get_encoding('cl100k_base')
 
# Lowercase alphabet
text = string.ascii_lowercase
print(f"Lower Alphabet: {len(tokenizer.encode(text))} token(s) for 26 chars (3.8%)")
 
# Digits
text = string.digits
print(f"Digits: {len(tokenizer.encode(text))} tokens for 10 chars (40.0%)")
 
# Punctuation
text = string.punctuation
print(f"Punctuation: {len(tokenizer.encode(text))} tokens for 32 chars (65.6%)")
OUTPUT
Lower Alphabet: 1 token(s) for 26 chars (3.8%)
Digits: 4 tokens for 10 chars (40.0%)
Punctuation: 21 tokens for 32 chars (65.6%)

💡 Key Findings

  1. Alphabet Density: GPT-4 has a single token for the entire lowercase alphabet string! That's massive compression (3.8%).
  2. Punctuation Penalty: Punctuation is "expensive." It takes 21 tokens to represent just 32 symbols (65%). This is why heavy formatting in prompts can quickly eat up your token limit.
  3. Domain Bias: BPE is tuned for natural language. The more your text looks like a "dictionary," the better the compression. The more it looks like "random noise" (or code symbols), the worse it gets.

💡 Summary

Compression ratio isn't just a technical metric—it's a proxy for how much the model "understands" the patterns in your data. Low compression means the model is seeing familiar patterns; high compression (approaching 1:1) means the model is struggling to find subword groups it recognizes.

Next, we'll look at how tokenization varies across different human languages!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.