++++
AI
Mar 2025×11 min read

Learn about BERT's unique approach to tokenization. Discover the WordPiece algorithm and why BERT needs special tokens like [CLS] and [SEP].

BERT Tokenization: WordPiece & Special Tokens 🦷

Driptanil Datta
Driptanil DattaSoftware Developer

BERT Tokenization: WordPiece & Special Tokens 🦷

While GPT uses Byte Pair Encoding (BPE), BERT uses a similar but distinct algorithm called WordPiece. BERT is also an "encoder-only" model designed for understanding context, which requires a few special additions to the token sequence.

🌍
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


1. WordPiece vs. BPE

WordPiece works similarly to BPE by merging frequent pairs, but it uses a likelihood-based approach rather than simple frequency. One signature of BERT's WordPiece is the ## prefix, which denotes a subword that is part of a larger word.

from transformers import BertTokenizer
 
# Load the base uncased BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
# Check vocabulary size
print(f"BERT Vocab Size: {tokenizer.vocab_size}")
OUTPUT
BERT Vocab Size: 30522

Notice that BERT's vocabulary (30k) is significantly smaller than GPT-4's (100k).


2. Special Tokens: [CLS] and [SEP]

BERT adds specific markers to the beginning and end of every sequence:

  • [CLS] (ID 101): The "Classification" token. It's meant to represent the summary of the entire sentence.
  • [SEP] (ID 102): The "Separation" token. It marks the end of a sentence or the boundary between two sentences.
text = "science is great"
encoded = tokenizer.encode(text)
 
for i in encoded:
  print(f"ID {i} -> '{tokenizer.decode([i])}'")
OUTPUT
ID 101 -> '[CLS]'
ID 2671 -> 'science'
ID 2003 -> 'is'
ID 2307 -> 'great'
ID 102 -> '[SEP]'

3. Subword Segmentation

Let's see how BERT handles words that aren't in its primary vocabulary.

sentence = "AI is both exciting and terrifying."
tokens = tokenizer.tokenize(sentence)
 
print(f"Tokens: {tokens}")
OUTPUT
Tokens: ['ai', 'is', 'both', 'exciting', 'and', 'terrifying', '.']

If we used a more complex word like "unbelievably":

tokenizer.tokenize("unbelievably")
# ['un', '##believ', '##ably']

The ## tells the model that these chunks are connected to the preceding token.


4. Why the Difference?

FeatureGPT (BPE)BERT (WordPiece)
LogicFrequency-based mergesLikelihood-based merges
SubwordsLeading space ( ' word')Hash prefix ('##word')
Special Tokens`<endoftext
CaseUsually case-sensitiveOften case-insensitive (uncased)

💡 Summary

BERT's tokenization is tailored for natural language understanding. The inclusion of [CLS] and [SEP] allows the model to process relationships between sentences and build a global representation of the input.

In the next lesson, we'll dive deeper into BERT's character-level handling and the [MASK] token!

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

© 2026 Driptanil Datta. All rights reserved.