++++

Mar 2025×9 min read

How much information fits into a single token? We analyze the relationship between word length and token count in GPT-4.

Token Efficiency: Words vs. Tokens ⚖️

Driptanil DattaSoftware Developer

Token Efficiency: Words vs. Tokens ⚖️

A common rule of thumb is that 1,000 tokens ≈ 750 words. But how true is that? Does it change for short words vs. long words? We'll use H.G. Wells' The Time Machine to measure the "efficiency" of the GPT-4 tokenizer.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. The Experiment

We'll take every word in the book, count its characters, and then see how many tokens GPT-4 uses to represent it.

import requests
import re
import numpy as np
import tiktoken
 
# Fetch text and setup tokenizer
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
tokenizer = tiktoken.get_encoding('cl100k_base')
 
# Split into words and punctuation
words = re.split(r'([,.:;—?_!"“()\\']|--|\\s)', text)
words = [w.strip() for w in words if w.strip()]
 
# Store (char_count, token_count) for each word
stats = np.zeros((len(words), 2), dtype=int)
 
for i, w in enumerate(words):
  stats[i, 0] = len(w)               # Word length (chars)
  stats[i, 1] = len(tokenizer.encode(w)) # Token count

2. Visualizing Information Density 📈

If we plot these, we can see how "compressed" our language becomes. We'll add some random jitter to the points so they don't all overlap.

import matplotlib.pyplot as plt
 
plt.figure(figsize=(12, 5))
 
# Add jitter for visibility
jx = np.random.randn(len(words)) / 20
jy = np.random.randn(len(words)) / 20
 
plt.scatter(stats[:, 0] + jx, stats[:, 1] + jy, color='black', s=1, alpha=0.3)
plt.xlabel('Word Length (Characters)')
plt.ylabel('Number of Tokens')
plt.title('Token Efficiency: Word Length vs. Token Count')
plt.xticks(range(1, 21))
plt.grid(alpha=0.2)
plt.show()

PLOT

↳ The plot reveals that words up to 6-8 characters are almost always represented by a single token. Efficiency drops (tokens increase) as words get longer or more complex.

3. Key Findings

The "Sweet Spot": For English text, words between 2 and 8 characters are highly efficient, usually costing exactly 1 token.
Diminishing Returns: Once a word exceeds ~10 characters (like experimental or crystallized), GPT-4 begins breaking it into multiple subword tokens.
The Multiplier: This confirms the "0.75 words per token" estimate. Short, common words are "cheap," while long, technical words are "expensive."

⚠️

Why does this matter? LLM providers charge by the token, not the word. Understanding token efficiency helps you estimate costs and optimize prompts—sometimes a shorter word is literally cheaper!

💡 Summary

Density: BPE allows common words to be extremely dense (1 token = many chars).
Granularity: Rare words are less dense (1 token = few chars).
Cost: Your bill is determined by the tokenizer's ability to find patterns in your specific text.

7. GPT-4 Tokenizer 9. The Strawberry Problem