🚀
3. Coding Challenge: Make a Tokenizer

Coding Challenge: Make a Tokenizer 🛠️

Now it's time to put what we've learned into practice. In this challenge, you will implement a complete encoding/decoding pipeline and visualize the resulting token IDs.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🎯 The Goal

  1. Encoder: Write a function that takes a string and returns a list of integer token IDs.
  2. Decoder: Write a function that takes a list of integers and returns the original string.
  3. Visualization: Use matplotlib to see the "fingerprint" of a sentence as a heat map.

📝 Part 1: Implementation

Using a sample corpus of quotes, create your vocabulary and mapping dictionaries.

import re
import numpy as np
 
text = [
    'All that we are is the result of what we have thought',
    'To be or not to be that is the question',
    'Be yourself everyone else is already taken'
]
 
# Create vocab
all_words = re.split(r'\s', ' '.join(text).lower())
vocab = sorted(set(all_words))
 
# Create maps
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for i, word in enumerate(vocab)}

The Functions

def encoder(input_text):
    # Parse into words
    words = re.split(' ', input_text.lower())
    # Map to indices
    return [word2idx[w] for w in words]
 
def decoder(indices):
    # Map back to words and join
    return ' '.join([idx2word[i] for i in indices])

🎨 Part 2: Visualization

Once text is converted to numbers, we can visualize it. This helps in understanding how much overlap there is between sentences and how sparse the data is.

import matplotlib.pyplot as plt
 
# A new sentence using words from our vocab
new_text = "we already are the result of what everyone else already thought"
token_ids = encoder(new_text)
 
# Visualize!
plt.figure(figsize=(10, 2))
plt.imshow([token_ids], aspect='auto', cmap='viridis')
plt.colorbar(label='Token ID')
plt.title(f'Token Sequence for: "{new_text}"')
plt.xlabel('Token Position')
plt.yticks([]) # Hide Y-axis

Visualization Result

Tokenizer Visualization

What are we looking at? Each block in the heat map represents a word. The color corresponds to the word's index in our vocabulary. This is the first step toward understanding how models "see" sentences as mathematical patterns.


🧠 Reflection

Try encoding a sentence with a word that isn't in your vocabulary. What happens?

  • Error? Most likely.
  • Solution? In real systems, we add a special <UNK> (Unknown) token to handle words that weren't seen during training.

In the next module, we'll dive into Subword Tokenization to solve the vocabulary explosion and unknown word problems permanently.

© 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love ❤️ | Last updated: Mar 16 2026