++++

Mar 2025×10 min read

Is AI more expensive for non-English speakers? We compare tokenization efficiency across Spanish, Arabic, Chinese, and more.

Tokenization Across Languages: The Fairness Gap 🌎

Driptanil DattaSoftware Developer

Tokenization Across Languages: The Fairness Gap 🌎

Most popular LLMs are trained primarily on English data. This creates a hidden disparity: the same concept might cost 1 token in English but 10 tokens in another language. This is often called the Token Tax.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. The Multi-Language Experiment

We translated the same sentence about "blue towels" into several languages and measured the token count using both BERT and GPT-4.

Language	Chars	BERT Tokens	GPT-4 Tokens
English	123	26	26
Spanish	132	46	34
Arabic	115	95	84
Chinese	39	39	55
Tamil	154	35	209

PLOT

↳ The 'Token Tax' in action: Non-Roman scripts like Tamil and Arabic require significantly more tokens than English for the same meaning.

2. Key Observations 🧐

The "English Bias"

In English, GPT-4 is extremely efficient (123 chars $\to$ 26 tokens). In Tamil, however, 154 characters explode into 209 tokens. This means a Tamil speaker might pay 8 times more to process the same information as an English speaker!

Roman vs. Non-Roman

Languages that use the Roman alphabet (Spanish, Esperanto) benefit from English-heavy training data because many subwords (like prefixes and suffixes) are shared. Non-Roman scripts (Arabic, Tamil) are often tokenized at the character level or even the byte level, which is far less efficient.

The Chinese Exception

Chinese is interesting. While it has more tokens than English for the same meaning, it has very few characters (39 chars $\to$ 55 tokens). Because each character carries a lot of meaning, the "meaning per token" is actually quite high, even if the "tokens per character" ratio looks bad.

3. Why It Matters

Cost: API usage is charged by token. Higher token counts = higher bills.
Context Window: Models have a fixed token limit (e.g., 128k). If your language is inefficiently tokenized, you can fit less "story" or "data" into the model's memory.
Performance: Models often perform slightly worse on languages where they have to think in smaller, less meaningful fragments.

Pro Tip: If you are building an app for a specific non-English market, check the tokenization efficiency of your chosen model first. Some models (like Llama-3) have significantly better multilingual vocabularies than others!

💡 Summary

Tokenization is not just a technical step—it's a socio-economic one. As LLMs become global infrastructure, the efficiency of their vocabularies across different scripts becomes a critical factor in accessibility and fairness.

Next, we'll explore a mathematical law that governs all human language: Zipf's Law!

15. Token Compression 17. Zipf's Law