++++
AI
Mar 2025×10 min read

Is AI more expensive for non-English speakers? We compare tokenization efficiency across Spanish, Arabic, Chinese, and more.

Tokenization Across Languages: The Fairness Gap 🌎

Driptanil Datta
Driptanil DattaSoftware Developer

Tokenization Across Languages: The Fairness Gap 🌎

Most popular LLMs are trained primarily on English data. This creates a hidden disparity: the same concept might cost 1 token in English but 10 tokens in another language. This is often called the Token Tax.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. The Multi-Language Experiment

We translated the same sentence about "blue towels" into several languages and measured the token count using both BERT and GPT-4.

LanguageCharsBERT TokensGPT-4 Tokens
English1232626
Spanish1324634
Arabic1159584
Chinese393955
Tamil15435209
PLOT
PLOT
The 'Token Tax' in action: Non-Roman scripts like Tamil and Arabic require significantly more tokens than English for the same meaning.

2. Key Observations 🧐

The "English Bias"

In English, GPT-4 is extremely efficient (123 chars \to 26 tokens). In Tamil, however, 154 characters explode into 209 tokens. This means a Tamil speaker might pay 8 times more to process the same information as an English speaker!

Roman vs. Non-Roman

Languages that use the Roman alphabet (Spanish, Esperanto) benefit from English-heavy training data because many subwords (like prefixes and suffixes) are shared. Non-Roman scripts (Arabic, Tamil) are often tokenized at the character level or even the byte level, which is far less efficient.

The Chinese Exception

Chinese is interesting. While it has more tokens than English for the same meaning, it has very few characters (39 chars \to 55 tokens). Because each character carries a lot of meaning, the "meaning per token" is actually quite high, even if the "tokens per character" ratio looks bad.


3. Why It Matters

  1. Cost: API usage is charged by token. Higher token counts = higher bills.
  2. Context Window: Models have a fixed token limit (e.g., 128k). If your language is inefficiently tokenized, you can fit less "story" or "data" into the model's memory.
  3. Performance: Models often perform slightly worse on languages where they have to think in smaller, less meaningful fragments.

Pro Tip: If you are building an app for a specific non-English market, check the tokenization efficiency of your chosen model first. Some models (like Llama-3) have significantly better multilingual vocabularies than others!


💡 Summary

Tokenization is not just a technical step—it's a socio-economic one. As LLMs become global infrastructure, the efficiency of their vocabularies across different scripts becomes a critical factor in accessibility and fairness.

Next, we'll explore a mathematical law that governs all human language: Zipf's Law!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.