๐Ÿš€
4. Tokenizing "The Time Machine"

Coding Challenge: Tokenizing "The Time Machine" ๐Ÿ•ฐ๏ธ

In this challenge, we step up our game by processing a full-length book: H.G. Wells' "The Time Machine". This exercise covers advanced text cleaning, building a larger vocabulary, and a fun experiment: decoding a "random walk" through the book.

๐ŸŒ
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


๐Ÿ“ฅ Exercise 1: Get and Prepare the Text

We'll fetch the raw text directly from Project Gutenberg and apply a series of standard pre-processing steps.

import requests
import re
import string
 
# Fetch raw text
url = 'https://www.gutenberg.org/files/35/35-0.txt'
text = requests.get(url).text
 
# Advanced cleaning: handle special chars and non-ASCII
strings2replace = [ '\\r\\n\\r\\n', '\\r\\n', '_', 'รข\\x80\\x9c', 'รข\\x80\\x9d' ]
for str2match in strings2replace:
  text = re.compile(r'%s'%str2match).sub(' ', text)
 
# Remove numbers and convert to lowercase
text = re.sub(r'[^\\x00-\\x7F]+', ' ', text)
text = re.sub(r'\\d+', '', text).lower()

๐Ÿงฉ Exercise 2: Building the Vocabulary

With our text cleaned, we can now extract all unique "tokens" (words) and create our mapping dictionaries.

# Split into words (ignoring punctuation)
words = re.split(fr'[{string.punctuation}\\s]+', text)
words = [w.strip() for w in words if len(w.strip()) > 1]
 
# Create the Lexicon
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
 
print(f"Total words: {len(words)}")
print(f"Unique tokens (Lexicon): {len(vocab)}")

๐ŸŽฒ Exercise 3: A Random Walk

Once the text is tokenized, we can manipulate it mathematically. One interesting way to see the variety of your vocabulary is to generate a "random walk": picking random token IDs and decoding them.

import numpy as np
 
# Pick 10 random token IDs
random_tokens = np.random.randint(0, len(vocab), 10)
 
# Decode them!
decoded_text = ' '.join([idx2word[i] for i in random_tokens])
print(f"Decoded: {decoded_text}")

Visualization

We can also visualize the density of tokens throughout the book. This helps identify repetitive patterns or unique sections.

Time Machine Visualization


๐Ÿ’ก Key Takeaway

By scaling up to a full book, we start to see the limitations of simple Word Tokenization:

  1. Vocabulary Size: The lexicon can grow extremely large.
  2. Out-of-Vocabulary (OOV): New text will almost certainly contain words we haven't seen.
  3. Efficiency: Storing every unique word as a distinct ID becomes computationally expensive as the corpus grows.

In the next module, we will explore Subword Tokenization (BPE) to solve these scaling issues!

ยฉ 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love โค๏ธ | Last updated: Mar 16 2026