10. Translating Between Tokenizers ��

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


Import two tokenizers

# GPT4
!pip install tiktoken
import tiktoken
gpt4Tokenizer = tiktoken.get_encoding('cl100k_base')
 
# BERT
from transformers import BertTokenizer
bertTokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Execution Output
Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)
Execution Output
/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Attempting a direct translation from GPT4 to BERT and back

# issue is that they have different tokenizers, so needs to be translated into text and re-tokenized
startingtext = 'Hello, my name is Mike and I like purple.'
 
# GPT4's tokens:
gpt4Toks = gpt4Tokenizer.encode(startingtext)
 
# bert's tokens
bertToks = bertTokenizer.encode(startingtext)
 
print(f'Starting text:\n{startingtext}')
print(f'\n\nGPT4 tokens:\n{gpt4Toks}')
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(gpt4Toks)}")
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(gpt4Toks)}")
 
print(f'\n\nBERT tokens:\n{bertToks}')
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(bertToks)}")
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(bertToks)}")
Execution Output
Starting text:
Hello, my name is Mike and I like purple.


GPT4 tokens:

The right way to translate (numbers to text)

# text -> GPT4 tokens -> text -> BERT tokens
 
# 1) to GPT4 tokens
startingtext = 'Hello, my name is Mike and I like purple.'
gpt4Toks = gpt4Tokenizer.encode(startingtext)
 
# 2) back to text
gpt4ReconText = gpt4Tokenizer.decode(gpt4Toks)
 
# 3) then to bert tokens
bertToks = bertTokenizer.encode(gpt4ReconText)
 
# 4) show the reconstruction
bertTokenizer.decode(bertToks)
Execution Output
'[CLS] hello, my name is mike and i like purple. [SEP]'

Possible annoyances and confusion in translations

# warning about sizes:
txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
print(f'Text contains {len(txt)} characters,')
print(f'              {len(gpt4Tokenizer.encode(txt))} GPT4 tokens, and')
print(f'              {len(bertTokenizer.encode(txt))} Bert tokens.')
Execution Output
Text contains 445 characters,
              96 GPT4 tokens, and
              160 Bert tokens.
# another source of confusion:
txt = 'start\r\n\r\n\r\n\n\r\n\r\n\t\t\t\n\r\n\rend'
# txt = 'start\t\t\t\t\t\t\tend'
# txt = 'start                    end'
 
bertToks = bertTokenizer.encode(txt)
gpt4Toks = gpt4Tokenizer.encode(txt)
 
print(f'Reconstruction in BERT:\n  {bertToks}\n  {bertTokenizer.decode(bertToks)}\n')
print(f'Reconstruction in GPT4:\n  {gpt4Toks}\n  {gpt4Tokenizer.decode(gpt4Toks)}')
Execution Output
Reconstruction in BERT:
  [101, 2707, 2203, 102]
  [CLS] start end [SEP]

Reconstruction in GPT4:

© 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love ❤️ | Last updated: Mar 16 2026