Text to Numbers 🔢
Before an AI model can "read," we must transform raw human language into a format that computers understand: numbers. This process starts with tokenization and concludes with creating an efficient encoding system.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🚀 The Core Concept
At its simplest, converting text to numbers involves three steps:
- Parsing: Breaking sentences into individual units (tokens).
- Indexing: Assigning a unique number to each unique token.
- Encoding: Translating our text into a sequence of these numbers.
1. Parsing Text into Words
The first challenge is deciding how to split the text. A common approach is splitting by whitespace using regular expressions.
import re
text = [
'All that we are is the result of what we have thought',
'To be or not to be that is the question',
'Be yourself everyone else is already taken'
]
# Separate first sentence into words
words = re.split('\s', text[0])
# Result: ['All', 'that', 'we', 'are', 'is', 'the', 'result', 'of', 'what', 'we', 'have', 'thought']Pro Tip: In real-world applications, we also normalize the text by converting it to lower-case to ensure that "The" and "the" are treated as the same word.
2. Creating a Vocabulary (Lexicon)
Once we have a list of all words in our corpus, we need to find the unique ones to build our Vocabulary.
# Combine all text and lowercase it
all_words = re.split('\s', ' '.join(text).lower())
# Create a sorted list of unique words
vocab = sorted(set(all_words))For our sample text:
- Total words: 29
- Vocabulary size: 21
3. Building the Encoder and Decoder
To move between words and numbers efficiently, we create two mapping dictionaries:
word2idx: Maps a word (string) to an index (integer).idx2word: Maps an index (integer) back to a word (string).
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for i, word in enumerate(vocab)}
print(f'The word "to" has index {word2idx["to"]}') # 17
print(f'The index "7" maps to the word "{idx2word[7]}"') # "is"4. Final Tokenization
With our maps ready, we can now "translate" any text from our vocabulary into a sequence of integers that a neural network can process.
# Translate the text into numbers
text_as_int = [ word2idx[word] for word in all_words ]
# Results in a list of integers: [0, 14, 18, 2, 7, 15, ...]From Numbers back to Text
for token_id in text_as_int[:5]:
print(f'Token {token_id}: {idx2word[token_id]}')
# Output:
# Token 0: all
# Token 14: that
# Token 18: we
# Token 2: are
# Token 7: is⚠️ Challenges & Limitations
While word-level tokenization is intuitive, it has significant drawbacks:
- Vocabulary Size: Can become massive (millions of words).
- Out-of-Vocabulary (OOV): New words or spelling errors can't be handled.
- Morphology: "Run", "Running", and "Ran" are treated as unrelated tokens.
In the next lesson, we'll explore Subword Tokenization (BPE), which solves these problems by splitting words into smaller, meaningful chunks.