Coding Challenge: Tokenizing "The Time Machine" ๐ฐ๏ธ
In this challenge, we step up our game by processing a full-length book: H.G. Wells' "The Time Machine". This exercise covers advanced text cleaning, building a larger vocabulary, and a fun experiment: decoding a "random walk" through the book.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ฅ Exercise 1: Get and Prepare the Text
We'll fetch the raw text directly from Project Gutenberg and apply a series of standard pre-processing steps.
import requests
import re
import string
# Fetch raw text
url = 'https://www.gutenberg.org/files/35/35-0.txt'
text = requests.get(url).text
# Advanced cleaning: handle special chars and non-ASCII
strings2replace = [ '\\r\\n\\r\\n', '\\r\\n', '_', 'รข\\x80\\x9c', 'รข\\x80\\x9d' ]
for str2match in strings2replace:
text = re.compile(r'%s'%str2match).sub(' ', text)
# Remove numbers and convert to lowercase
text = re.sub(r'[^\\x00-\\x7F]+', ' ', text)
text = re.sub(r'\\d+', '', text).lower()๐งฉ Exercise 2: Building the Vocabulary
With our text cleaned, we can now extract all unique "tokens" (words) and create our mapping dictionaries.
# Split into words (ignoring punctuation)
words = re.split(fr'[{string.punctuation}\\s]+', text)
words = [w.strip() for w in words if len(w.strip()) > 1]
# Create the Lexicon
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
print(f"Total words: {len(words)}")
print(f"Unique tokens (Lexicon): {len(vocab)}")๐ฒ Exercise 3: A Random Walk
Once the text is tokenized, we can manipulate it mathematically. One interesting way to see the variety of your vocabulary is to generate a "random walk": picking random token IDs and decoding them.
import numpy as np
# Pick 10 random token IDs
random_tokens = np.random.randint(0, len(vocab), 10)
# Decode them!
decoded_text = ' '.join([idx2word[i] for i in random_tokens])
print(f"Decoded: {decoded_text}")Visualization
We can also visualize the density of tokens throughout the book. This helps identify repetitive patterns or unique sections.
๐ก Key Takeaway
By scaling up to a full book, we start to see the limitations of simple Word Tokenization:
- Vocabulary Size: The lexicon can grow extremely large.
- Out-of-Vocabulary (OOV): New text will almost certainly contain words we haven't seen.
- Efficiency: Storing every unique word as a distinct ID becomes computationally expensive as the corpus grows.
In the next module, we will explore Subword Tokenization (BPE) to solve these scaling issues!