++++

Mar 2025×10 min read

How does the training corpus affect the semantic structure of word embeddings? We compare GloVe models trained on formal Wikipedia text versus informal Twitter data.

Wikipedia vs. Twitter Embeddings 🐦

Driptanil DattaSoftware Developer

Word embeddings are not objective representations of human language; they are mirrors of the data they were trained on. In this lesson, we compare two GloVe models to see how domain specificity and corpus bias shift the semantic relationships between words.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

Training a model on Wikipedia (formal, encyclopedic) versus Twitter (informal, conversational) leads to distinct "semantic drift":

Corpus Bias: A word like "battery" might associate with "artillery" in historical Wikipedia texts, but "charger" or "phone" in everyday Twitter discourse.
Domain Specificity: The vocabulary size and the "meaning" of slang or modern tech terms vary wildly between these two symbolic universes.
Stability vs. Flux: Formal corpora provide stable, standard definitions, while social media corpora capture evolving, context-heavy usage.

1. Environment Setup

We'll use numpy and matplotlib to compare the two models side-by-side.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
 
# svg figure format
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

2. Downloading the Models

We load two different pre-trained GloVe models: one trained on Wikipedia (Gigaword 5) and one trained on 2 billion tweets. Both use 50-dimensional vectors for a fair comparison.

# NOTE: If you get errors importing, run the following !pip... line,
# then restart your session (from Runtime menu) and comment out the pip line.
# !pip install gensim
 
import gensim.downloader as api
 
# download the wikipedia and twitter models
wiki_model = api.load('glove-wiki-gigaword-50')
twitter_model = api.load('glove-twitter-50')

3. Comparing Vocabulary Size

Twitter's vocabulary is significantly larger (over 1 million items) compared to Wikipedia's 400,000, likely due to the presence of hashtags, handles, and informal variations of words.

# embedding matrix dimensions
print(f'Wikipedia model has {len(wiki_model.index_to_key)} words and {wiki_model.vector_size} embedding dimensions.')
print(f'Twitter model has {len(twitter_model.index_to_key)} words and {twitter_model.vector_size} embedding dimensions.')

OUTPUT

Wikipedia model has 400000 words and 50 embedding dimensions.
Twitter model has 1193514 words and 50 embedding dimensions.

4. Visualizing Vector Profiles

Even for a simple word like "table", the 50-dimensional vectors from the two models look completely different. This is because the "dimensions" in one latent space do not correspond to the dimensions in the other.

targetword = 'table'
 
_,axs = plt.subplots(1,figsize=(6,4))
axs.plot(wiki_model[targetword],'ks',markerfacecolor=[.7,.7,.9],markersize=8,label='Wikipedia')
axs.plot(twitter_model[targetword],'ko',markerfacecolor=[.7,.9,.7],markersize=8,label='Twitter')
axs.set(xlabel='Embedding dimension',ylabel='Embedding value',title=f'Embeddings for "{targetword}"')
axs.legend()
 
plt.tight_layout()
plt.show()

Output 1

5. Similarity within Models

We can check how similar "table" and "chair" are within each model. While both models recognize them as related, the exact cosine similarity score varies based on the frequency of their co-occurrence in the training data.

# word pair
word1 = 'table'
word2 = 'chair'
 
# scatter plot for wiki
_,axs = plt.subplots(1,2,figsize=(12,4.5))
axs[0].plot(wiki_model[word1], wiki_model[word2],'ks',markersize=9,markerfacecolor=[.9,.7,.7])
axs[0].set(xlabel=f'Embedding for "{word1}"',ylabel=f'Embedding for "{word2}"',
           title=f'WIKI (Cosine similarity: { np.round(wiki_model.similarity(word1,word2),3)})')
 
# scatter plot for twitter
axs[1].plot(twitter_model[word1], twitter_model[word2],'ks',markersize=9,markerfacecolor=[.9,.7,.7])
axs[1].set(xlabel=f'Embedding for "{word1}"',ylabel=f'Embedding for "{word2}"',
           title=f'TWITTER (Cosine similarity: { np.round(twitter_model.similarity(word1,word2),3)})')
 
 
plt.tight_layout()
plt.show()

Output 2

6. Semantic Drift: The "Battery" Test

This is where the corpus bias becomes obvious. In Wikipedia, "battery" is highly associated with technical terms like "lithium-ion" but also military terms like "weapon" and "gun" (as in an artillery battery). In Twitter, it's almost exclusively associated with consumer electronics.

print('10 words most similar to "battery" in wiki:')
 
for word,sim in wiki_model.most_similar('battery',topn=10):
    print(f'  {word:15s} {sim:.3f}')
 
print('\nAnd in twitter:')
 
for word,sim in twitter_model.most_similar('battery',topn=10):
    print(f'  {word:15s} {sim:.3f}')

OUTPUT

10 words most similar to "battery" in wiki:
  batteries       0.832
  rechargeable    0.726
  lithium-ion     0.708
  weapon          0.693

7. Analyzing a Sentence: Fox vs. Dog

Let's look at a classic sentence and compare the index levels for each word.

text = 'The quick brown fox jumps over the lazy dog'
 
import re
words = text.split( ' ' )
 
# index sequence in the two embeddings
wiki_idx = [wiki_model.key_to_index[w] for w in words if w in wiki_model.key_to_index]
twit_idx = [twitter_model.key_to_index[w] for w in words if w in twitter_model.key_to_index]
 
print(' Word |  Wiki | Twitter')
print('-'*23)
for o,w,t in zip(words,wiki_idx,twit_idx):
  print(f'{o:>5} | {w:>5} | {t:>5}')

OUTPUT

 Word |  Wiki | Twitter
-----------------------
  The |  2582 |  2156
quick |  1042 |  1871
brown |  2106 |  4000

8. Final Comparison: Inter-word Similarities

Finally, we compare how word-pairs relate to each other across both models. We've fixed a common indexing bug to ensure all words are properly found in both vocabularies.

# Filter words to ensure they exist in both models
valid_words = [w for w in words if w in wiki_model.key_to_index and w in twitter_model.key_to_index]
 
plt.figure(figsize=(9,7))
 
for i in range(len(valid_words)):
  for j in range(i+1, len(valid_words)):
    w1, w2 = valid_words[i], valid_words[j]
    if w1 == w2: continue
 
    cs_wiki = wiki_model.similarity(w1, w2)
    cs_twit = twitter_model.similarity(w1, w2)
 
    # distance to unity
    v = np.array([cs_wiki, cs_twit])
    u = np.array([1, 1])
    dist = np.linalg.norm(v - (sum(v*u)) / (np.linalg.norm(u)**2) * u)
 
    plt.plot(cs_wiki, cs_twit, 'ks', markersize=9, markerfacecolor=mpl.cm.plasma(dist*5))
    plt.text(cs_wiki, cs_twit+.02, f'{w1}-{w2}', va='bottom', ha='center', fontsize=8)
 
xylims = [.05, .95]
plt.plot(xylims, xylims, '--', color=[.4, .4, .4], zorder=-30)
plt.gca().set(xlim=xylims, ylim=xylims, xlabel='Wiki inter-word similarities',
              ylabel='Twitter inter-word similarities', title='Inter-word similarities')
plt.show()

Output 3

GloVe Embeddings 🧤GPT-2 vs. BERT 🧠