++++Learn about BERT's unique approach to tokenization. Discover the WordPiece algorithm and why BERT needs special tokens like [CLS] and [SEP].
BERT Tokenization: WordPiece & Special Tokens 🦷
BERT Tokenization: WordPiece & Special Tokens 🦷
While GPT uses Byte Pair Encoding (BPE), BERT uses a similar but distinct algorithm called WordPiece. BERT is also an "encoder-only" model designed for understanding context, which requires a few special additions to the token sequence.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. WordPiece vs. BPE
WordPiece works similarly to BPE by merging frequent pairs, but it uses a likelihood-based approach rather than simple frequency. One signature of BERT's WordPiece is the ## prefix, which denotes a subword that is part of a larger word.
from transformers import BertTokenizer
# Load the base uncased BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Check vocabulary size
print(f"BERT Vocab Size: {tokenizer.vocab_size}")BERT Vocab Size: 30522Notice that BERT's vocabulary (30k) is significantly smaller than GPT-4's (100k).
2. Special Tokens: [CLS] and [SEP]
BERT adds specific markers to the beginning and end of every sequence:
[CLS](ID 101): The "Classification" token. It's meant to represent the summary of the entire sentence.[SEP](ID 102): The "Separation" token. It marks the end of a sentence or the boundary between two sentences.
text = "science is great"
encoded = tokenizer.encode(text)
for i in encoded:
print(f"ID {i} -> '{tokenizer.decode([i])}'")ID 101 -> '[CLS]'
ID 2671 -> 'science'
ID 2003 -> 'is'
ID 2307 -> 'great'
ID 102 -> '[SEP]'3. Subword Segmentation
Let's see how BERT handles words that aren't in its primary vocabulary.
sentence = "AI is both exciting and terrifying."
tokens = tokenizer.tokenize(sentence)
print(f"Tokens: {tokens}")Tokens: ['ai', 'is', 'both', 'exciting', 'and', 'terrifying', '.']If we used a more complex word like "unbelievably":
tokenizer.tokenize("unbelievably")
# ['un', '##believ', '##ably']The ## tells the model that these chunks are connected to the preceding token.
4. Why the Difference?
| Feature | GPT (BPE) | BERT (WordPiece) |
|---|---|---|
| Logic | Frequency-based merges | Likelihood-based merges |
| Subwords | Leading space ( ' word') | Hash prefix ('##word') |
| Special Tokens | `< | endoftext |
| Case | Usually case-sensitive | Often case-insensitive (uncased) |
💡 Summary
BERT's tokenization is tailored for natural language understanding. The inclusion of [CLS] and [SEP] allows the model to process relationships between sentences and build a global representation of the input.
In the next lesson, we'll dive deeper into BERT's character-level handling and the [MASK] token!