9. CodeChallenge: BERT Character Counts 🧩
🌍
References & Disclaimer
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
import numpy as np
import string
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')# load BERT tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')Execution Output
/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdmExercise 1: Character counts in BERT tokens
# set of digits and letters
digitsLetters = string.digits + string.ascii_lowercase
# initialize results vector
charCount = np.zeros(len(digitsLetters),dtype=int)
# count the appearances (excluding "unused")
for i,c in enumerate(digitsLetters):
charCount[i] = np.sum([ c in tok for tok in tokenizer.vocab.keys() if not 'unused' in tok ])# and plot
plt.figure(figsize=(12,3))
plt.bar(range(len(charCount)),charCount,color=[.7,.7,.7],edgecolor='k')
plt.gca().set(xticks=range(len(charCount)),xticklabels=list(digitsLetters),
xlim=[-.6,len(charCount)-.4],xlabel='Character',ylabel='Count',
title='Frequency of characters in BERT tokens')
plt.show()Exercise 2: Report the sorted characters
charOrder = np.argsort(charCount)[::-1]
for i in charOrder:
print(f'"{digitsLetters[i]}" appears in {charCount[i]:6,} tokens.')Execution Output
"e" appears in 14,633 tokens.
"a" appears in 12,381 tokens.
"i" appears in 11,614 tokens.
"r" appears in 10,991 tokens.
"n" appears in 10,735 tokens.