++++

Mar 2025×10 min read

To scale beyond simple word-splitting, modern LLMs like GPT-4 use sophisticated subword tokenization. We will use OpenAI's `tiktoken` library to dissect how these models perceive and process language.

GPT-4 Tokenizer

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

While we previously built a simple word-level tokenizer, tiktoken (used by GPT-4) uses Byte-Pair Encoding (BPE) to handle the complexities of real-world text. Our exploration covers:

Tiktoken Setup: Initializing the cl100k_base encoding.
Vocabulary Depth: Exploring a 100,277-token universe.
Tokenization Logic: Seeing how punctuation and subwords are handled.
Statistical Insights: Visualizing how token lengths vary across the entire vocabulary.

1. Setup & Imports

To use GPT-4's tokenizer, we need the tiktoken library. We'll also import numpy and matplotlib for analysis and visualization.

import numpy as np
import matplotlib.pyplot as plt
 
# matplotlib defaults
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# need to install the tiktoken library to get OpenAI's tokenizer
# note: it's tik-token, not tiktok-en :P
!pip install tiktoken
import tiktoken

OUTPUT

Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)

2. Loading the GPT-4 Encoding

OpenAI provides several encodings. For GPT-4 (and GPT-3.5), the standard is cl100k_base. Let's initialize it and see what's inside.

# GPT-4's tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')
dir(tokenizer)

OUTPUT

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',

# get help
tokenizer??

OUTPUT

Type:           Encoding
String form:    <Encoding 'cl100k_base'>
File:           ~/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tiktoken/core.py
Source:        
class Encoding:

3. Exploring Vocabulary Size

One of the reasons GPT-4 is so capable is its massive vocabulary. Unlike our simple word-level tokenizer, tiktoken manages over 100,000 unique tokens.

# vocab size
tokenizer.n_vocab

OUTPUT

4. Special Tokens

BPE tokenizers use "special" tokens for specific purposes, like marking the end of a text string (<|endoftext|>).

tokenizer.decode([tokenizer.eot_token])

OUTPUT

'<|endoftext|>'

# but not all tokens are valid, e.g.,
print(tokenizer.n_vocab)
tokenizer.decode([100261])

OUTPUT

KeyError: 'Invalid token for decoding: 100277'

KeyError Traceback (most recent call last)
Cell In[9], line 3
1 # but not all tokens are valid, e.g.,

# list of all tokens:
# https://github.com/vnglst/gpt4-tokens/blob/main/decode-tokens.ipynb

Explore some tokens

5. Exploring Individual Tokens

Let's look at what the first 50 tokens (from index 1000) actually represent. Notice how many tokens are actually pieces of words or common suffixes like "ception" or "include".

for i in range(1000,1050):
  print(f'{i} = {tokenizer.decode([i])}')

OUTPUT

1000 = indow
1001 = lement
1002 = pect
1003 = ash
1004 = [i

Tokenization!

6. Tokenization in Practice

Now, let's see how a full sentence is broken down. We'll encode a string and then inspect how each "word" is actually composed of one or more tokens.

text = "My name is Mike and I like toothpaste-flavored chocolate."
tokens = tokenizer.encode(text)
print(tokens)

OUTPUT

[5159, 836, 374, 11519, 323, 358, 1093, 26588, 57968, 12556, 76486, 18414, 13]

text.split()

OUTPUT

['My',
 'name',
 'is',
 'Mike',
 'and',

for word in text.split():
  print(f'"{word}" comprises token(s) {tokenizer.encode(word)}')

OUTPUT

"My" comprises token(s) [5159]
"name" comprises token(s) [609]
"is" comprises token(s) [285]
"Mike" comprises token(s) [35541]
"and" comprises token(s) [438]

for t in tokens:
  print(f'Token {t:>6} is "{tokenizer.decode([t])}"')

OUTPUT

Token   5159 is "My"
Token    836 is " name"
Token    374 is " is"
Token  11519 is " Mike"
Token    323 is " and"

# with special (non-ASCII) characters
tokenizer.encode('â')

OUTPUT

[9011]

7. Token Length Distribution

To understand the tokenizer's complexity, we can visualize the distribution of token lengths. Most tokens are between 3 and 6 characters long, which is the "sweet spot" for common subword units.

# initialize lengths vector
token_lengths = np.zeros(tokenizer.n_vocab)
 
# get the number of characters in each token
for idx in range(tokenizer.n_vocab):
  try:
    token_lengths[idx] = len(tokenizer.decode([idx]))
  except:
    token_lengths[idx] = np.nan
 
# count unique lengths
uniqueLengths,tokenCount = np.unique(token_lengths,return_counts=True)
 
 
 
# visualize
_,axs = plt.subplots(1,2,figsize=(12,4))
axs[0].plot(token_lengths,'k.',markersize=3,alpha=.4)
axs[0].set(xlim=[0,tokenizer.n_vocab],xlabel='Token index',ylabel='Token length (characters)',
           title='GPT4 token lengths')
 
axs[1].bar(uniqueLengths,tokenCount,color='k',edgecolor='gray')
axs[1].set(xlim=[0,max(uniqueLengths)],xlabel='Token length (chars)',ylabel='Token count (log scale)',
           title='Distribution of token lengths')
 
plt.tight_layout()
plt.show()

Output 1

Many word-tokens start with spaces

8. The Power of Leading Spaces

In BPE, a space prefix is often treated as part of the token itself. This is why " Michael" and "Michael" result in different token IDs.

# single-token words with vs. without spaces
print( tokenizer.encode(' Michael') )
print( tokenizer.encode('Michael') )

OUTPUT

[8096]
[26597]

# multi-token words without a space
print( tokenizer.encode(' Peach') )
print( tokenizer.encode('Peach') )

OUTPUT

[64695]
[47, 9739]

peach = tokenizer.encode('Peach')
[tokenizer.decode([p]) for p in peach]

OUTPUT

['P', 'each']

9. Scaling to a Full Book

Finally, let's see how the tokenizer performs on a large corpus. We'll download "The Time Machine" from Project Gutenberg and encode the entire text.

import requests
import re
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
 
# split by punctuation
words = re.split(r'([,.:;—?_!"“()\']|--|\s)',text)
words = [item.strip() for item in words if item.strip()]
print(f'There are {len(words)} words.')
words[10000:10050]

OUTPUT

There are 37786 words.

OUTPUT

['I',
 'was',
 'not',
 'loath',
 'to',

# tokens of a random word in the text
someRandomWord = np.random.choice(words)
print(f'"{someRandomWord}" has token {tokenizer.encode(someRandomWord)}')

OUTPUT

"has" has token [4752]

for t in words[:20]:
  print(f'"{t}" has {len(tokenizer.encode(t))} tokens')

OUTPUT

"***" has 1 tokens
"START" has 1 tokens
"OF" has 1 tokens
"THE" has 1 tokens
"PROJECT" has 1 tokens

for spelling in ['book','Book','bOok']:
  print(f'"{spelling}" has tokens {tokenizer.encode(spelling)}')

OUTPUT

"book" has tokens [2239]
"Book" has tokens [7280]
"bOok" has tokens [65, 46, 564]

But do we need to separate the text into words?

# what happens if we just tokenize the raw (unprocessed) text?
tmTokens = tokenizer.encode(text)
print(f'The text has {len(tmTokens):,} tokens and {len(words):,} words.')

OUTPUT

The text has 43,053 tokens and 37,786 words.

# check out some tokens
 
for t in tmTokens[9990:10020]:
  print(f'Token {t:>6}: "{tokenizer.decode([t])}"')

OUTPUT

Token    264: " a"
Token   3094: " step"
Token   4741: " forward"
Token     11: ","
Token  20365: " hes"

print(tokenizer.decode(tmTokens[9990:10020]))

OUTPUT

 a step forward, hesitated, and then touched my
hand. Then I felt other soft little tentacles upon my back and
shoulders.

6. Coding Challenge: BPE Loop 8 Bert Tokenizer