++++

Mar 2025×10 min read

To understand the scale of modern language modeling, we must move beyond short phrases and process a full-length book. In this challenge, we'll tokenize H.G. Wells' The Time Machine.

Coding Challenge: Tokenizing The Time Machine 🕰️

Driptanil DattaSoftware Developer

Coding Challenge: Tokenizing "The Time Machine" 🕰️

In this challenge, we step up our game by processing a full-length book: H.G. Wells' "The Time Machine". This exercise covers advanced text cleaning, building a larger vocabulary, and a fun experiment: decoding a "random walk" through the book.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

Processing a real-world corpus introduces several complexities that short examples hide:

Massive Scale: Managing thousands of tokens and ensuring memory efficiency.
Advanced Cleaning: Handling non-ASCII characters, formatting artifacts, and specialized punctuation.
Lexicon Building: Constructing a robust word2idx mapping for an entire literary work.
Random Decoding: Verifying our token mapping through random sampling (the "Random Walk").

1. Fetching & Cleaning the Text

We'll fetch the raw text directly from Project Gutenberg and apply a series of standard pre-processing steps, including removing non-ASCII characters and normalizing whitespace.

import requests
import re
import string
 
# Fetch raw text
url = 'https://www.gutenberg.org/files/35/35-0.txt'
text = requests.get(url).text
 
# Advanced cleaning: handle special chars and non-ASCII
strings2replace = [ '\\r\\n\\r\\n', '\\r\\n', '_', 'â\\x80\\x9c', 'â\\x80\\x9d' ]
for str2match in strings2replace:
  text = re.compile(r'%s'%str2match).sub(' ', text)
 
# Remove numbers and convert to lowercase
text = re.sub(r'[^\\x00-\\x7F]+', ' ', text)
text = re.sub(r'\\d+', '', text).lower()

2. Building the Lexicon

With our text cleaned, we can now extract all unique "tokens" (words) and create our mapping dictionaries (word2idx and idx2word).

# Split into words (ignoring punctuation)
words = re.split(fr'[{string.punctuation}\\s]+', text)
words = [w.strip() for w in words if len(w.strip()) > 1]
 
# Create the Lexicon
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}

OUTPUT

Total words: 32742
Unique tokens (Lexicon): 6134

3. Decoding & Random Walk

Once the text is tokenized, we can manipulate it mathematically. A "random walk" picks random token IDs and decodes them, showing the breadth of the book's vocabulary.

import numpy as np
 
# Pick 10 random token IDs
random_tokens = np.random.randint(0, len(vocab), 10)
 
# Decode them!
decoded_text = ' '.join([idx2word[i] for i in random_tokens])

OUTPUT

Decoded: mahogany through after however though for again before above after

4. Token Density Visualization

Visualizing the density of tokens throughout the book helps identify repetitive patterns or unique sections. This is a common practice in exploratory data analysis (EDA).

Time Machine Visualization

💡 Key Takeaway

By scaling up to a full book, we start to see the limitations of simple Word Tokenization:

Vocabulary Size: The lexicon can grow extremely large.
Out-of-Vocabulary (OOV): New text will almost certainly contain words we haven't seen.
Efficiency: Storing every unique word as a distinct ID becomes computationally expensive as the corpus grows.

In the next module, we will explore Subword Tokenization (BPE) to solve these scaling issues!

3. Coding Challenge: Make a Tokenizer 5. Byte Pair Encoding (BPE) Concepts