10. Translating Between Tokenizers ��
🌍
References & Disclaimer
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
Import two tokenizers
# GPT4
!pip install tiktoken
import tiktoken
gpt4Tokenizer = tiktoken.get_encoding('cl100k_base')
# BERT
from transformers import BertTokenizer
bertTokenizer = BertTokenizer.from_pretrained('bert-base-uncased')Execution Output
Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)Execution Output
/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdmAttempting a direct translation from GPT4 to BERT and back
# issue is that they have different tokenizers, so needs to be translated into text and re-tokenized
startingtext = 'Hello, my name is Mike and I like purple.'
# GPT4's tokens:
gpt4Toks = gpt4Tokenizer.encode(startingtext)
# bert's tokens
bertToks = bertTokenizer.encode(startingtext)
print(f'Starting text:\n{startingtext}')
print(f'\n\nGPT4 tokens:\n{gpt4Toks}')
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(gpt4Toks)}")
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(gpt4Toks)}")
print(f'\n\nBERT tokens:\n{bertToks}')
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(bertToks)}")
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(bertToks)}")Execution Output
Starting text:
Hello, my name is Mike and I like purple.
GPT4 tokens:The right way to translate (numbers to text)
# text -> GPT4 tokens -> text -> BERT tokens
# 1) to GPT4 tokens
startingtext = 'Hello, my name is Mike and I like purple.'
gpt4Toks = gpt4Tokenizer.encode(startingtext)
# 2) back to text
gpt4ReconText = gpt4Tokenizer.decode(gpt4Toks)
# 3) then to bert tokens
bertToks = bertTokenizer.encode(gpt4ReconText)
# 4) show the reconstruction
bertTokenizer.decode(bertToks)Execution Output
'[CLS] hello, my name is mike and i like purple. [SEP]'Possible annoyances and confusion in translations
# warning about sizes:
txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
print(f'Text contains {len(txt)} characters,')
print(f' {len(gpt4Tokenizer.encode(txt))} GPT4 tokens, and')
print(f' {len(bertTokenizer.encode(txt))} Bert tokens.')Execution Output
Text contains 445 characters,
96 GPT4 tokens, and
160 Bert tokens.# another source of confusion:
txt = 'start\r\n\r\n\r\n\n\r\n\r\n\t\t\t\n\r\n\rend'
# txt = 'start\t\t\t\t\t\t\tend'
# txt = 'start end'
bertToks = bertTokenizer.encode(txt)
gpt4Toks = gpt4Tokenizer.encode(txt)
print(f'Reconstruction in BERT:\n {bertToks}\n {bertTokenizer.decode(bertToks)}\n')
print(f'Reconstruction in GPT4:\n {gpt4Toks}\n {gpt4Tokenizer.decode(gpt4Toks)}')Execution Output
Reconstruction in BERT:
[101, 2707, 2203, 102]
[CLS] start end [SEP]
Reconstruction in GPT4: