++++

Mar 2025×10 min read

In real-world applications, data rarely arrives in a clean, perfectly formatted state. Before we can convert text to numbers, we must perform several pre-processing steps to standardize the input and remove noise.

Preparing Text for Tokens 📖

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

Pre-processing is the critical bridge between raw, messy data and numerical token IDs:

Data Fetching: Retrieving raw text from digital repositories and handling connection states.
Cleaning & Regex: Using regular expressions to systematically replace formatting artifacts and special quotes.
Normalization: Standardizing the text by removing non-essential information (numbers, non-ASCII) and unifying the case.
Robust Mapping: Wrapping logic into reusable functions to define a clear boundary for the model's input.

1. Fetching Raw Data

Most LLM training pipelines start by web-scraping or fetching text from large repositories like Project Gutenberg. Here, we fetch the complete text of The Time Machine.

import requests
 
# Fetching 'The Time Machine' by H. G. Wells
response = requests.get('https://www.gutenberg.org/files/35/35-0.txt')
raw_text = response.text

2. Cleaning Up the Text

Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer. We'll use a series of steps to sanitize our data.

Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer.

1. Removing Special Characters

We use regular expressions to replace newlines, tabs, and specialized quotes with standard spaces.

import re
 
# Character strings to replace
strings_to_replace = ['\r\n', '—', '_', '“', '”']
 
for s in strings_to_replace:
    raw_text = re.sub(s, ' ', raw_text)

2. Stripping Non-ASCII & Numbers

To keep our initial vocabulary manageable, we often remove complex characters and numerical digits.

# Remove non-ASCII characters
text = re.sub(r'[^\x00-\x7F]+', ' ', raw_text)
 
# Remove numbers
text = re.sub(r'\d+', '', text)

3. Case Normalization

Converting everything to lowercase prevents the model from treating "Machine" and "machine" as two separate concepts.

text = text.lower()

3. Advanced Parsing

While splitting by whitespace is a start, we must also handle punctuation. We don't want "machine." (with a period) to be seen as a different token than "machine" (without).

text = "  Hello, world!  "
# Remove whitespace and punctuation
text = text.strip().lower().replace('!', '')
pattern = fr'[{string.punctuation}\s]+'
 
# Split and clean
words = [w.strip() for w in re.split(pattern, text) if len(w.strip()) > 1]

4. Robust Encoding Functions

As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions for reusability and clarity.

As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions.

import numpy as np
 
def encoder(word_list, word2idx):
    # Initialize a numerical vector
    indices = np.zeros(len(word_list), dtype=int)
    for i, word in enumerate(word_list):
        indices[i] = word2idx[word]
    return indices
 
def decoder(indices, idx2word):
    # Reconstruct the string
    return ' '.join([idx2word[i] for i in indices if i in idx2word])

Testing the Pipeline

# Create maps
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
 
# Encode-then-Decode check
original_phrase = ['the', 'time', 'machine']
encoded = encoder(original_phrase, word2idx)
decoded = decoder(encoded, idx2word)

OUTPUT

[4042 4109 2416]
'the time machine'


<Callout type="tip">
  By cleaning the text first, we've reduced the complexity the model has to
  learn, allowing it to focus on the semantic relationships between words.
</Callout>

1. Text to Numbers 3. Coding Challenge: Make a Tokenizer