++++

Mar 2025×11 min read

Learn about BERT's unique approach to tokenization. Discover the WordPiece algorithm and why BERT needs special tokens like [CLS] and [SEP].

BERT Tokenization: WordPiece & Special Tokens 🦷

Driptanil DattaSoftware Developer

BERT Tokenization: WordPiece & Special Tokens 🦷

While GPT uses Byte Pair Encoding (BPE), BERT uses a similar but distinct algorithm called WordPiece. BERT is also an "encoder-only" model designed for understanding context, which requires a few special additions to the token sequence.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. WordPiece vs. BPE

WordPiece works similarly to BPE by merging frequent pairs, but it uses a likelihood-based approach rather than simple frequency. One signature of BERT's WordPiece is the ## prefix, which denotes a subword that is part of a larger word.

from transformers import BertTokenizer
 
# Load the base uncased BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
# Check vocabulary size
print(f"BERT Vocab Size: {tokenizer.vocab_size}")

OUTPUT

BERT Vocab Size: 30522

Notice that BERT's vocabulary (30k) is significantly smaller than GPT-4's (100k).

2. Special Tokens: [CLS] and [SEP]

BERT adds specific markers to the beginning and end of every sequence:

[CLS] (ID 101): The "Classification" token. It's meant to represent the summary of the entire sentence.
[SEP] (ID 102): The "Separation" token. It marks the end of a sentence or the boundary between two sentences.

text = "science is great"
encoded = tokenizer.encode(text)
 
for i in encoded:
  print(f"ID {i} -> '{tokenizer.decode([i])}'")

OUTPUT

ID 101 -> '[CLS]'
ID 2671 -> 'science'
ID 2003 -> 'is'
ID 2307 -> 'great'
ID 102 -> '[SEP]'

3. Subword Segmentation

Let's see how BERT handles words that aren't in its primary vocabulary.

sentence = "AI is both exciting and terrifying."
tokens = tokenizer.tokenize(sentence)
 
print(f"Tokens: {tokens}")

OUTPUT

Tokens: ['ai', 'is', 'both', 'exciting', 'and', 'terrifying', '.']

If we used a more complex word like "unbelievably":

tokenizer.tokenize("unbelievably")
# ['un', '##believ', '##ably']

The ## tells the model that these chunks are connected to the preceding token.

4. Why the Difference?

Feature	GPT (BPE)	BERT (WordPiece)
Logic	Frequency-based merges	Likelihood-based merges
Subwords	Leading space ( `' word'`)	Hash prefix (`'##word'`)
Special Tokens	`<	endoftext
Case	Usually case-sensitive	Often case-insensitive (`uncased`)

💡 Summary

BERT's tokenization is tailored for natural language understanding. The inclusion of [CLS] and [SEP] allows the model to process relationships between sentences and build a global representation of the input.

In the next lesson, we'll dive deeper into BERT's character-level handling and the [MASK] token!

10. Rapper Names 12. BERT Characters