Preparing Text for Tokens ๐Ÿ“–

In real-world applications, data rarely arrives in a clean, perfectly formatted state. Before we can convert text to numbers, we must perform several pre-processing steps to standardize the input and remove noise.

๐ŸŒ
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


๐ŸŒ Fetching Raw Data

Most LLM training pipelines start by web-scraping or fetching text from large repositories like Project Gutenberg.

import requests
 
# Fetching 'The Time Machine' by H. G. Wells
response = requests.get('https://www.gutenberg.org/files/35/35-0.txt')
raw_text = response.text

๐Ÿงน Cleaning Up the Text

Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer.

1. Removing Special Characters

We use regular expressions to replace newlines, tabs, and specialized quotes with standard spaces.

import re
 
# Character strings to replace
strings_to_replace = ['\r\n', 'โ€”', '_', 'โ€œ', 'โ€']
 
for s in strings_to_replace:
    raw_text = re.sub(s, ' ', raw_text)

2. Stripping Non-ASCII & Numbers

To keep our initial vocabulary manageable, we often remove complex characters and numerical digits.

# Remove non-ASCII characters
text = re.sub(r'[^\x00-\x7F]+', ' ', raw_text)
 
# Remove numbers
text = re.sub(r'\d+', '', text)

3. Case Normalization

Converting everything to lowercase prevents the model from treating "Machine" and "machine" as two separate concepts.

text = text.lower()

โœ‚๏ธ Advanced Parsing

While splitting by whitespace is a start, we must also handle punctuation. We don't want "machine." (with a period) to be a different token than "machine".

text = "  Hello, world!  "
# Remove whitespace and punctuation
text = text.strip().lower().replace('!', '')
pattern = fr'[{string.punctuation}\s]+'
 
# Split and clean
words = [w.strip() for w in re.split(pattern, text) if len(w.strip()) > 1]

๐Ÿ—๏ธ Robust Encoding Functions

As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions.

import numpy as np
 
def encoder(word_list, word2idx):
    # Initialize a numerical vector
    indices = np.zeros(len(word_list), dtype=int)
    for i, word in enumerate(word_list):
        indices[i] = word2idx[word]
    return indices
 
def decoder(indices, idx2word):
    # Reconstruct the string
    return ' '.join([idx2word[i] for i in indices if i in idx2word])

Testing the Pipeline

# Create maps
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
 
# Encode-then-Decode check
original_phrase = ['the', 'time', 'machine']
encoded = encoder(original_phrase, word2idx)
decoded = decoder(encoded, idx2word)
 
print(encoded) # [4042 4109 2416]
print(decoded) # 'the time machine'

By cleaning the text first, we've reduced the complexity the model has to learn, allowing it to focus on the semantic relationships between words.

ยฉ 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love โค๏ธ | Last updated: Mar 16 2026