++++
AI
Mar 2025×15 min read

Build a basic tokenizer from scratch, implement encoder/decoder functions, and visualize how text is transformed into numerical vectors.

Make a Tokenizer 🛠️

Driptanil Datta
Driptanil DattaSoftware Developer

Make a Tokenizer 🛠️

Now that we've seen the basic concepts of word-to-index mapping, it's time to put them into practice. We'll build a full tokenization pipeline for a small corpus and visualize the results.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🚀 The Mission

Your task is to:

  1. Parse a multi-sentence corpus into individual words.
  2. Build a unique vocabulary (lexicon).
  3. Implement robust encoder and decoder functions.
  4. Visualize the resulting tokens and their one-hot representations.

1. Setup & Data

We'll start with three simple sentences. Our goal is to treat them as a single corpus.

import re
import numpy as np
import matplotlib.pyplot as plt
 
# list of sentences
text = [ 'All that we are is the result of what we have thought',
         'To be or not to be that is the question',
         'Be yourself everyone else is already taken' ]
 
# create a vocab of unique words
allwords = re.split(r'\s',' '.join(text).lower())
vocab = sorted(set(allwords))
 
# create an encoder and decoder dictionaries
word2idx = { word:i for i,word in enumerate(vocab) }
idx2word = { i:word for i,word in enumerate(vocab) }
 
print(f"Vocabulary Size: {len(vocab)}")
print(f"Sample mapping: 'to' -> {word2idx['to']}")
OUTPUT
Vocabulary Size: 21
Sample mapping: 'to' -> 17

2. Encoder & Decoder Functions

Instead of manually looking up words, we need functions that can handle entire strings or lists of IDs.

### the encoder function
def encoder(text):
  # parse the text into words
  words = re.split(' ', text.lower())
  # return the vector of indices
  return [ word2idx[w] for w in words ]
 
### now for the decoder
def decoder(indices):
  # find the words for these indices, and join into one string
  return ' '.join([ idx2word[i] for i in indices ])
 
# Test the pipeline
newtext = 'we already are the result of what everyone else already thought'
newtext_tokenIDs = encoder(newtext)
decoded_text = decoder(newtext_tokenIDs)
 
print(f'Token IDs: {newtext_tokenIDs}')
print(f'Decoded:   {decoded_text}')
OUTPUT
Token IDs: [18, 1, 2, 15, 12, 9, 19, 5, 4, 1, 16]
Decoded:   we already are the result of what everyone else already thought

3. Visualizing Tokens

Language models don't "see" text; they see sequences of integers. We can visualize this sequence to understand the density and repetition of our vocabulary.

# get all the text and all the tokens
alltext = ' '.join(text)
tokens = encoder(alltext)
 
# plot the tokens
_, ax = plt.subplots(1, figsize=(12, 5))
ax.plot(tokens, 'ks', markersize=12, markerfacecolor=[.7, .7, .9])
ax.set(xlabel='Word index', yticks=range(len(vocab)))
ax.grid(linestyle='--', axis='y')
 
# invisible axis for right-hand-side labels
ax2 = ax.twinx()
ax2.plot(tokens, alpha=0)
ax2.set(yticks=range(len(vocab)), yticklabels=vocab)
 
plt.show()
PLOT
PLOT
A plot of token IDs over time, showing the 'shape' of our sentences.

4. One-Hot Encoding 🧮

While integers are better than strings, neural networks often prefer One-Hot Encoding for categorical data. This transforms each token into a vector of zeros with a single 1 at the token's index.

word_matrix = np.zeros((len(allwords), len(vocab)), dtype=int)
 
# create the matrix
for i, word in enumerate(allwords):
  word_matrix[i, word2idx[word]] = 1
 
print(f'One-hot encoding matrix size: {word_matrix.shape}')
print(word_matrix[:5, :5]) # Show a small slice
OUTPUT
One-hot encoding matrix size: (29, 21)
[[1 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 1 0 0]
 [0 0 0 0 0]]

💡 Key Takeaway

By the end of this challenge, we've transformed raw human language into a structured numerical matrix. This matrix is the bridge between the world of semantics and the world of linear algebra.

In the next lesson, we'll see how to prepare large-scale real-world text (like a full book) for this process!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.