++++
AI
Mar 2025×8 min read

Deep dive into BERT's vocabulary. We'll write a script to count how often every letter and digit appears across all 30,000+ BERT tokens.

Character Frequency in BERT 📊

Driptanil Datta
Driptanil DattaSoftware Developer

Character Frequency in BERT 📊

How "English-centric" is BERT? One way to find out is to look at the distribution of characters within its vocabulary. We'll iterate through every single token in the bert-base-uncased model and count the occurrences of letters and digits.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. The Goal

We want to know which characters are the most "popular" inside BERT's brain. This helps us understand if the model is biased toward certain sounds, prefixes, or suffixes.


2. Implementation

We'll use the transformers library to load the vocabulary and numpy to aggregate the results.

from transformers import BertTokenizer
import numpy as np
import string
import matplotlib.pyplot as plt
 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
# Characters we want to count (a-z and 0-9)
chars = string.ascii_lowercase + string.digits
counts = np.zeros(len(chars), dtype=int)
 
# Extract all unique tokens
vocab = tokenizer.get_vocab().keys()
 
# Count occurrences
for i, c in enumerate(chars):
  # Skip [unusedX] tokens which are placeholders
  counts[i] = sum([c in tok for tok in vocab if 'unused' not in tok])

3. Visualizing the Results 📈

plt.figure(figsize=(12, 4))
plt.bar(range(len(chars)), counts, color='gray', edgecolor='black')
plt.xticks(range(len(chars)), list(chars))
plt.xlabel('Character')
plt.ylabel('Number of Tokens Containing Character')
plt.title('Character Frequency in BERT Vocabulary')
plt.show()
PLOT
PLOT
The distribution confirms that common English vowels like 'e', 'a', and 'i' dominate the vocabulary, appearing in nearly 50% of all tokens.

4. Top 5 Most Frequent Characters

CharacterToken Count
e14,633
a12,381
i11,614
r10,991
n10,735

💡 What We Learned

  1. Vowel Dominance: The character 'e' appears in nearly half of BERT's 30,522 tokens. This isn't surprising given its frequency in English.
  2. Number Scarcity: Digits (0-9) appear far less frequently than letters, explaining why BERT (and LLMs in general) can struggle with precise numerical reasoning.
  3. Vocabulary Design: The vocabulary isn't just a list of words; it's a carefully balanced set of subword fragments designed to cover as much of the language as possible with as few "pieces" as necessary.

Pro Tip: Notice how 'z', 'q', and 'j' are at the bottom of the list. If you're building a domain-specific model (like for chemistry or genetics), you might need a custom tokenizer that prioritizes the characters relevant to your data!


💡 Summary

By analyzing the "atoms" (characters) of the "molecules" (tokens), we gain a deeper intuition for why models behave the way they do. Next, we'll look at Token Translation—the process of converting these IDs between different models!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.