🚀
7. Exploring GPT-4's Tokenizer
++++
AI
Mar 2025×10 min read

To scale beyond simple word-splitting, modern LLMs like GPT-4 use sophisticated subword tokenization. We will use OpenAI's `tiktoken` library to dissect how these models perceive and process language.

GPT-4 Tokenizer

Driptanil Datta
Driptanil DattaSoftware Developer
🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🚀 The Core Concept

While we previously built a simple word-level tokenizer, tiktoken (used by GPT-4) uses Byte-Pair Encoding (BPE) to handle the complexities of real-world text. Our exploration covers:

  1. Tiktoken Setup: Initializing the cl100k_base encoding.
  2. Vocabulary Depth: Exploring a 100,277-token universe.
  3. Tokenization Logic: Seeing how punctuation and subwords are handled.
  4. Statistical Insights: Visualizing how token lengths vary across the entire vocabulary.

1. Setup & Imports

To use GPT-4's tokenizer, we need the tiktoken library. We'll also import numpy and matplotlib for analysis and visualization.

import numpy as np
import matplotlib.pyplot as plt
 
# matplotlib defaults
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
# need to install the tiktoken library to get OpenAI's tokenizer
# note: it's tik-token, not tiktok-en :P
!pip install tiktoken
import tiktoken
OUTPUT
MORE
Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)

2. Loading the GPT-4 Encoding

OpenAI provides several encodings. For GPT-4 (and GPT-3.5), the standard is cl100k_base. Let's initialize it and see what's inside.

# GPT-4's tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')
dir(tokenizer)
OUTPUT
MORE
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
# get help
tokenizer??
OUTPUT
MORE
Type:           Encoding
String form:    <Encoding 'cl100k_base'>
File:           ~/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tiktoken/core.py
Source:        
class Encoding:

3. Exploring Vocabulary Size

One of the reasons GPT-4 is so capable is its massive vocabulary. Unlike our simple word-level tokenizer, tiktoken manages over 100,000 unique tokens.

# vocab size
tokenizer.n_vocab
OUTPUT
100277

4. Special Tokens

BPE tokenizers use "special" tokens for specific purposes, like marking the end of a text string (<|endoftext|>).

tokenizer.decode([tokenizer.eot_token])
OUTPUT
'<|endoftext|>'
# but not all tokens are valid, e.g.,
print(tokenizer.n_vocab)
tokenizer.decode([100261])
OUTPUT
100277
OUTPUT
MORE
KeyError: 'Invalid token for decoding: 100277'

KeyError Traceback (most recent call last)
Cell In[9], line 3
1 # but not all tokens are valid, e.g.,
# list of all tokens:
# https://github.com/vnglst/gpt4-tokens/blob/main/decode-tokens.ipynb

Explore some tokens

5. Exploring Individual Tokens

Let's look at what the first 50 tokens (from index 1000) actually represent. Notice how many tokens are actually pieces of words or common suffixes like "ception" or "include".

for i in range(1000,1050):
  print(f'{i} = {tokenizer.decode([i])}')
OUTPUT
MORE
1000 = indow
1001 = lement
1002 = pect
1003 = ash
1004 = [i

Tokenization!

6. Tokenization in Practice

Now, let's see how a full sentence is broken down. We'll encode a string and then inspect how each "word" is actually composed of one or more tokens.

text = "My name is Mike and I like toothpaste-flavored chocolate."
tokens = tokenizer.encode(text)
print(tokens)
OUTPUT
[5159, 836, 374, 11519, 323, 358, 1093, 26588, 57968, 12556, 76486, 18414, 13]
text.split()
OUTPUT
MORE
['My',
 'name',
 'is',
 'Mike',
 'and',
for word in text.split():
  print(f'"{word}" comprises token(s) {tokenizer.encode(word)}')
OUTPUT
MORE
"My" comprises token(s) [5159]
"name" comprises token(s) [609]
"is" comprises token(s) [285]
"Mike" comprises token(s) [35541]
"and" comprises token(s) [438]
for t in tokens:
  print(f'Token {t:>6} is "{tokenizer.decode([t])}"')
OUTPUT
MORE
Token   5159 is "My"
Token    836 is " name"
Token    374 is " is"
Token  11519 is " Mike"
Token    323 is " and"
# with special (non-ASCII) characters
tokenizer.encode('â')
OUTPUT
[9011]

7. Token Length Distribution

To understand the tokenizer's complexity, we can visualize the distribution of token lengths. Most tokens are between 3 and 6 characters long, which is the "sweet spot" for common subword units.

# initialize lengths vector
token_lengths = np.zeros(tokenizer.n_vocab)
 
# get the number of characters in each token
for idx in range(tokenizer.n_vocab):
  try:
    token_lengths[idx] = len(tokenizer.decode([idx]))
  except:
    token_lengths[idx] = np.nan
 
# count unique lengths
uniqueLengths,tokenCount = np.unique(token_lengths,return_counts=True)
 
 
 
# visualize
_,axs = plt.subplots(1,2,figsize=(12,4))
axs[0].plot(token_lengths,'k.',markersize=3,alpha=.4)
axs[0].set(xlim=[0,tokenizer.n_vocab],xlabel='Token index',ylabel='Token length (characters)',
           title='GPT4 token lengths')
 
axs[1].bar(uniqueLengths,tokenCount,color='k',edgecolor='gray')
axs[1].set(xlim=[0,max(uniqueLengths)],xlabel='Token length (chars)',ylabel='Token count (log scale)',
           title='Distribution of token lengths')
 
plt.tight_layout()
plt.show()

Output 1

Many word-tokens start with spaces

8. The Power of Leading Spaces

In BPE, a space prefix is often treated as part of the token itself. This is why " Michael" and "Michael" result in different token IDs.

# single-token words with vs. without spaces
print( tokenizer.encode(' Michael') )
print( tokenizer.encode('Michael') )
OUTPUT
[8096]
[26597]
# multi-token words without a space
print( tokenizer.encode(' Peach') )
print( tokenizer.encode('Peach') )
OUTPUT
[64695]
[47, 9739]
peach = tokenizer.encode('Peach')
[tokenizer.decode([p]) for p in peach]
OUTPUT
['P', 'each']

9. Scaling to a Full Book

Finally, let's see how the tokenizer performs on a large corpus. We'll download "The Time Machine" from Project Gutenberg and encode the entire text.

import requests
import re
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
 
# split by punctuation
words = re.split(r'([,.:;—?_!"“()\']|--|\s)',text)
words = [item.strip() for item in words if item.strip()]
print(f'There are {len(words)} words.')
words[10000:10050]
OUTPUT
There are 37786 words.
OUTPUT
MORE
['I',
 'was',
 'not',
 'loath',
 'to',
# tokens of a random word in the text
someRandomWord = np.random.choice(words)
print(f'"{someRandomWord}" has token {tokenizer.encode(someRandomWord)}')
OUTPUT
"has" has token [4752]
for t in words[:20]:
  print(f'"{t}" has {len(tokenizer.encode(t))} tokens')
OUTPUT
MORE
"***" has 1 tokens
"START" has 1 tokens
"OF" has 1 tokens
"THE" has 1 tokens
"PROJECT" has 1 tokens
for spelling in ['book','Book','bOok']:
  print(f'"{spelling}" has tokens {tokenizer.encode(spelling)}')
OUTPUT
"book" has tokens [2239]
"Book" has tokens [7280]
"bOok" has tokens [65, 46, 564]

But do we need to separate the text into words?

# what happens if we just tokenize the raw (unprocessed) text?
tmTokens = tokenizer.encode(text)
print(f'The text has {len(tmTokens):,} tokens and {len(words):,} words.')
OUTPUT
The text has 43,053 tokens and 37,786 words.
# check out some tokens
 
for t in tmTokens[9990:10020]:
  print(f'Token {t:>6}: "{tokenizer.decode([t])}"')
OUTPUT
MORE
Token    264: " a"
Token   3094: " step"
Token   4741: " forward"
Token     11: ","
Token  20365: " hes"
print(tokenizer.decode(tmTokens[9990:10020]))
OUTPUT
 a step forward, hesitated, and then touched my
hand. Then I felt other soft little tentacles upon my back and
shoulders.
Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.