Text to Numbers 🔢

Driptanil DattaSoftware Developer

Text to Numbers 🔢

Before an AI model can "read," we must transform raw human language into a format that computers understand: numbers. This process starts with tokenization and concludes with creating an efficient encoding system.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

At its simplest, converting text to numbers involves three steps:

Parsing: Breaking sentences into individual units (tokens).
Indexing: Assigning a unique number to each unique token.
Encoding: Translating our text into a sequence of these numbers.

1. Parsing Text into Words

The first challenge is deciding how to split the text. A common approach is splitting by whitespace using regular expressions.

import re
 
text = [
    'All that we are is the result of what we have thought',
    'To be or not to be that is the question',
    'Be yourself everyone else is already taken'
]
 
# Separate first sentence into words
words = re.split('\s', text[0])
# Result: ['All', 'that', 'we', 'are', 'is', 'the', 'result', 'of', 'what', 'we', 'have', 'thought']

Pro Tip: In real-world applications, we also normalize the text by converting it to lower-case to ensure that "The" and "the" are treated as the same word.

2. Creating a Vocabulary (Lexicon)

Once we have a list of all words in our corpus, we need to find the unique ones to build our Vocabulary.

# Combine all text and lowercase it
all_words = re.split('\s', ' '.join(text).lower())
 
# Create a sorted list of unique words
vocab = sorted(set(all_words))

For our sample text:

Total words: 29
Vocabulary size: 21

3. Building the Encoder and Decoder

To move between words and numbers efficiently, we create two mapping dictionaries:

word2idx: Maps a word (string) to an index (integer).
idx2word: Maps an index (integer) back to a word (string).

word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for i, word in enumerate(vocab)}
 
print(f'The word "to" has index {word2idx["to"]}') # 17
print(f'The index "7" maps to the word "{idx2word[7]}"') # "is"

4. Final Tokenization

With our maps ready, we can now "translate" any text from our vocabulary into a sequence of integers that a neural network can process.

# Translate the text into numbers
text_as_int = [ word2idx[word] for word in all_words ]
 
# Results in a list of integers: [0, 14, 18, 2, 7, 15, ...]

From Numbers back to Text

for token_id in text_as_int[:5]:
    print(f'Token {token_id}: {idx2word[token_id]}')
 
# Output:
# Token  0: all
# Token 14: that
# Token 18: we
# Token  2: are
# Token  7: is

⚠️ Challenges & Limitations

While word-level tokenization is intuitive, it has significant drawbacks:

Vocabulary Size: Can become massive (millions of words).
Out-of-Vocabulary (OOV): New words or spelling errors can't be handled.
Morphology: "Run", "Running", and "Ran" are treated as unrelated tokens.

In the next lesson, we'll explore Subword Tokenization (BPE), which solves these problems by splitting words into smaller, meaningful chunks.

🔢 Token Embeddings 2. Preparing Text for Tokens

Text to Numbers 🔢

Text to Numbers 🔢

🚀 The Core Concept

1. Parsing Text into Words

2. Creating a Vocabulary (LexiconLexicon)

3. Building the Encoder and Decoder

4. Final Tokenization

From Numbers back to Text

⚠️ Challenges & Limitations

2. Creating a Vocabulary (Lexicon)