++++
AI
Mar 2025×9 min read

How much information fits into a single token? We analyze the relationship between word length and token count in GPT-4.

Token Efficiency: Words vs. Tokens ⚖️

Driptanil Datta
Driptanil DattaSoftware Developer

Token Efficiency: Words vs. Tokens ⚖️

A common rule of thumb is that 1,000 tokens ≈ 750 words. But how true is that? Does it change for short words vs. long words? We'll use H.G. Wells' The Time Machine to measure the "efficiency" of the GPT-4 tokenizer.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. The Experiment

We'll take every word in the book, count its characters, and then see how many tokens GPT-4 uses to represent it.

import requests
import re
import numpy as np
import tiktoken
 
# Fetch text and setup tokenizer
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
tokenizer = tiktoken.get_encoding('cl100k_base')
 
# Split into words and punctuation
words = re.split(r'([,.:;—?_!"“()\\']|--|\\s)', text)
words = [w.strip() for w in words if w.strip()]
 
# Store (char_count, token_count) for each word
stats = np.zeros((len(words), 2), dtype=int)
 
for i, w in enumerate(words):
  stats[i, 0] = len(w)               # Word length (chars)
  stats[i, 1] = len(tokenizer.encode(w)) # Token count

2. Visualizing Information Density 📈

If we plot these, we can see how "compressed" our language becomes. We'll add some random jitter to the points so they don't all overlap.

import matplotlib.pyplot as plt
 
plt.figure(figsize=(12, 5))
 
# Add jitter for visibility
jx = np.random.randn(len(words)) / 20
jy = np.random.randn(len(words)) / 20
 
plt.scatter(stats[:, 0] + jx, stats[:, 1] + jy, color='black', s=1, alpha=0.3)
plt.xlabel('Word Length (Characters)')
plt.ylabel('Number of Tokens')
plt.title('Token Efficiency: Word Length vs. Token Count')
plt.xticks(range(1, 21))
plt.grid(alpha=0.2)
plt.show()
PLOT
PLOT
The plot reveals that words up to 6-8 characters are almost always represented by a single token. Efficiency drops (tokens increase) as words get longer or more complex.

3. Key Findings

  1. The "Sweet Spot": For English text, words between 2 and 8 characters are highly efficient, usually costing exactly 1 token.
  2. Diminishing Returns: Once a word exceeds ~10 characters (like experimental or crystallized), GPT-4 begins breaking it into multiple subword tokens.
  3. The Multiplier: This confirms the "0.75 words per token" estimate. Short, common words are "cheap," while long, technical words are "expensive."
⚠️

Why does this matter? LLM providers charge by the token, not the word. Understanding token efficiency helps you estimate costs and optimize prompts—sometimes a shorter word is literally cheaper!


💡 Summary

  • Density: BPE allows common words to be extremely dense (1 token = many chars).
  • Granularity: Rare words are less dense (1 token = few chars).
  • Cost: Your bill is determined by the tokenizer's ability to find patterns in your specific text.
Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.