Text to Numbers 🔢

Before an AI model can "read," we must transform raw human language into a format that computers understand: numbers. This process starts with tokenization and concludes with creating an efficient encoding system.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


🚀 The Core Concept

At its simplest, converting text to numbers involves three steps:

  1. Parsing: Breaking sentences into individual units (tokens).
  2. Indexing: Assigning a unique number to each unique token.
  3. Encoding: Translating our text into a sequence of these numbers.

1. Parsing Text into Words

The first challenge is deciding how to split the text. A common approach is splitting by whitespace using regular expressions.

import re
 
text = [
    'All that we are is the result of what we have thought',
    'To be or not to be that is the question',
    'Be yourself everyone else is already taken'
]
 
# Separate first sentence into words
words = re.split('\s', text[0])
# Result: ['All', 'that', 'we', 'are', 'is', 'the', 'result', 'of', 'what', 'we', 'have', 'thought']

Pro Tip: In real-world applications, we also normalize the text by converting it to lower-case to ensure that "The" and "the" are treated as the same word.


2. Creating a Vocabulary (Lexicon)

Once we have a list of all words in our corpus, we need to find the unique ones to build our Vocabulary.

# Combine all text and lowercase it
all_words = re.split('\s', ' '.join(text).lower())
 
# Create a sorted list of unique words
vocab = sorted(set(all_words))

For our sample text:

  • Total words: 29
  • Vocabulary size: 21

3. Building the Encoder and Decoder

To move between words and numbers efficiently, we create two mapping dictionaries:

  • word2idx: Maps a word (string) to an index (integer).
  • idx2word: Maps an index (integer) back to a word (string).
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for i, word in enumerate(vocab)}
 
print(f'The word "to" has index {word2idx["to"]}') # 17
print(f'The index "7" maps to the word "{idx2word[7]}"') # "is"

4. Final Tokenization

With our maps ready, we can now "translate" any text from our vocabulary into a sequence of integers that a neural network can process.

# Translate the text into numbers
text_as_int = [ word2idx[word] for word in all_words ]
 
# Results in a list of integers: [0, 14, 18, 2, 7, 15, ...]

From Numbers back to Text

for token_id in text_as_int[:5]:
    print(f'Token {token_id}: {idx2word[token_id]}')
 
# Output:
# Token  0: all
# Token 14: that
# Token 18: we
# Token  2: are
# Token  7: is

⚠️ Challenges & Limitations

While word-level tokenization is intuitive, it has significant drawbacks:

  • Vocabulary Size: Can become massive (millions of words).
  • Out-of-Vocabulary (OOV): New words or spelling errors can't be handled.
  • Morphology: "Run", "Running", and "Ran" are treated as unrelated tokens.

In the next lesson, we'll explore Subword Tokenization (BPE), which solves these problems by splitting words into smaller, meaningful chunks.

© 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love ❤️ | Last updated: Mar 16 2026