Preparing Text for Tokens ๐
In real-world applications, data rarely arrives in a clean, perfectly formatted state. Before we can convert text to numbers, we must perform several pre-processing steps to standardize the input and remove noise.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ Fetching Raw Data
Most LLM training pipelines start by web-scraping or fetching text from large repositories like Project Gutenberg.
import requests
# Fetching 'The Time Machine' by H. G. Wells
response = requests.get('https://www.gutenberg.org/files/35/35-0.txt')
raw_text = response.text๐งน Cleaning Up the Text
Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer.
1. Removing Special Characters
We use regular expressions to replace newlines, tabs, and specialized quotes with standard spaces.
import re
# Character strings to replace
strings_to_replace = ['\r\n', 'โ', '_', 'โ', 'โ']
for s in strings_to_replace:
raw_text = re.sub(s, ' ', raw_text)2. Stripping Non-ASCII & Numbers
To keep our initial vocabulary manageable, we often remove complex characters and numerical digits.
# Remove non-ASCII characters
text = re.sub(r'[^\x00-\x7F]+', ' ', raw_text)
# Remove numbers
text = re.sub(r'\d+', '', text)3. Case Normalization
Converting everything to lowercase prevents the model from treating "Machine" and "machine" as two separate concepts.
text = text.lower()โ๏ธ Advanced Parsing
While splitting by whitespace is a start, we must also handle punctuation. We don't want "machine." (with a period) to be a different token than "machine".
text = " Hello, world! "
# Remove whitespace and punctuation
text = text.strip().lower().replace('!', '')
pattern = fr'[{string.punctuation}\s]+'
# Split and clean
words = [w.strip() for w in re.split(pattern, text) if len(w.strip()) > 1]๐๏ธ Robust Encoding Functions
As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions.
import numpy as np
def encoder(word_list, word2idx):
# Initialize a numerical vector
indices = np.zeros(len(word_list), dtype=int)
for i, word in enumerate(word_list):
indices[i] = word2idx[word]
return indices
def decoder(indices, idx2word):
# Reconstruct the string
return ' '.join([idx2word[i] for i in indices if i in idx2word])Testing the Pipeline
# Create maps
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
# Encode-then-Decode check
original_phrase = ['the', 'time', 'machine']
encoded = encoder(original_phrase, word2idx)
decoded = decoder(encoded, idx2word)
print(encoded) # [4042 4109 2416]
print(decoded) # 'the time machine'By cleaning the text first, we've reduced the complexity the model has to learn, allowing it to focus on the semantic relationships between words.