
Wikipedia vs. Twitter Embeddings ๐ฆ
How does the training corpus affect the semantic structure of word embeddings? We compare GloVe models trained on formal Wikipedia text versus informal Twitter data.
Word embeddings are not objective representations of human language; they are mirrors of the data they were trained on. In this lesson, we compare two GloVe models to see how domain specificity and corpus bias shift the semantic relationships between words.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ The Core Concept
Training a model on Wikipedia (formal, encyclopedic) versus Twitter (informal, conversational) leads to distinct "semantic drift":
- Corpus Bias: A word like "battery" might associate with "artillery" in historical Wikipedia texts, but "charger" or "phone" in everyday Twitter discourse.
- Domain Specificity: The vocabulary size and the "meaning" of slang or modern tech terms vary wildly between these two symbolic universes.
- Stability vs. Flux: Formal corpora provide stable, standard definitions, while social media corpora capture evolving, context-heavy usage.
1. Environment Setup
We'll use numpy and matplotlib to compare the two models side-by-side.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
# svg figure format
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')2. Downloading the Models
We load two different pre-trained GloVe models: one trained on Wikipedia (Gigaword 5) and one trained on 2 billion tweets. Both use 50-dimensional vectors for a fair comparison.
# NOTE: If you get errors importing, run the following !pip... line,
# then restart your session (from Runtime menu) and comment out the pip line.
# !pip install gensim
import gensim.downloader as api
# download the wikipedia and twitter models
wiki_model = api.load('glove-wiki-gigaword-50')
twitter_model = api.load('glove-twitter-50')3. Comparing Vocabulary Size
Twitter's vocabulary is significantly larger (over 1 million items) compared to Wikipedia's 400,000, likely due to the presence of hashtags, handles, and informal variations of words.
# embedding matrix dimensions
print(f'Wikipedia model has {len(wiki_model.index_to_key)} words and {wiki_model.vector_size} embedding dimensions.')
print(f'Twitter model has {len(twitter_model.index_to_key)} words and {twitter_model.vector_size} embedding dimensions.')Wikipedia model has 400000 words and 50 embedding dimensions.
Twitter model has 1193514 words and 50 embedding dimensions.4. Visualizing Vector Profiles
Even for a simple word like "table", the 50-dimensional vectors from the two models look completely different. This is because the "dimensions" in one latent space do not correspond to the dimensions in the other.
targetword = 'table'
_,axs = plt.subplots(1,figsize=(6,4))
axs.plot(wiki_model[targetword],'ks',markerfacecolor=[.7,.7,.9],markersize=8,label='Wikipedia')
axs.plot(twitter_model[targetword],'ko',markerfacecolor=[.7,.9,.7],markersize=8,label='Twitter')
axs.set(xlabel='Embedding dimension',ylabel='Embedding value',title=f'Embeddings for "{targetword}"')
axs.legend()
plt.tight_layout()
plt.show()5. Similarity within Models
We can check how similar "table" and "chair" are within each model. While both models recognize them as related, the exact cosine similarity score varies based on the frequency of their co-occurrence in the training data.
# word pair
word1 = 'table'
word2 = 'chair'
# scatter plot for wiki
_,axs = plt.subplots(1,2,figsize=(12,4.5))
axs[0].plot(wiki_model[word1], wiki_model[word2],'ks',markersize=9,markerfacecolor=[.9,.7,.7])
axs[0].set(xlabel=f'Embedding for "{word1}"',ylabel=f'Embedding for "{word2}"',
title=f'WIKI (Cosine similarity: { np.round(wiki_model.similarity(word1,word2),3)})')
# scatter plot for twitter
axs[1].plot(twitter_model[word1], twitter_model[word2],'ks',markersize=9,markerfacecolor=[.9,.7,.7])
axs[1].set(xlabel=f'Embedding for "{word1}"',ylabel=f'Embedding for "{word2}"',
title=f'TWITTER (Cosine similarity: { np.round(twitter_model.similarity(word1,word2),3)})')
plt.tight_layout()
plt.show()6. Semantic Drift: The "Battery" Test
This is where the corpus bias becomes obvious. In Wikipedia, "battery" is highly associated with technical terms like "lithium-ion" but also military terms like "weapon" and "gun" (as in an artillery battery). In Twitter, it's almost exclusively associated with consumer electronics.
print('10 words most similar to "battery" in wiki:')
for word,sim in wiki_model.most_similar('battery',topn=10):
print(f' {word:15s} {sim:.3f}')
print('\nAnd in twitter:')
for word,sim in twitter_model.most_similar('battery',topn=10):
print(f' {word:15s} {sim:.3f}')10 words most similar to "battery" in wiki:
batteries 0.832
rechargeable 0.726
lithium-ion 0.708
weapon 0.6937. Analyzing a Sentence: Fox vs. Dog
Let's look at a classic sentence and compare the index levels for each word.
text = 'The quick brown fox jumps over the lazy dog'
import re
words = text.split( ' ' )
# index sequence in the two embeddings
wiki_idx = [wiki_model.key_to_index[w] for w in words if w in wiki_model.key_to_index]
twit_idx = [twitter_model.key_to_index[w] for w in words if w in twitter_model.key_to_index]
print(' Word | Wiki | Twitter')
print('-'*23)
for o,w,t in zip(words,wiki_idx,twit_idx):
print(f'{o:>5} | {w:>5} | {t:>5}') Word | Wiki | Twitter
-----------------------
The | 2582 | 2156
quick | 1042 | 1871
brown | 2106 | 40008. Final Comparison: Inter-word Similarities
Finally, we compare how word-pairs relate to each other across both models. We've fixed a common indexing bug to ensure all words are properly found in both vocabularies.
# Filter words to ensure they exist in both models
valid_words = [w for w in words if w in wiki_model.key_to_index and w in twitter_model.key_to_index]
plt.figure(figsize=(9,7))
for i in range(len(valid_words)):
for j in range(i+1, len(valid_words)):
w1, w2 = valid_words[i], valid_words[j]
if w1 == w2: continue
cs_wiki = wiki_model.similarity(w1, w2)
cs_twit = twitter_model.similarity(w1, w2)
# distance to unity
v = np.array([cs_wiki, cs_twit])
u = np.array([1, 1])
dist = np.linalg.norm(v - (sum(v*u)) / (np.linalg.norm(u)**2) * u)
plt.plot(cs_wiki, cs_twit, 'ks', markersize=9, markerfacecolor=mpl.cm.plasma(dist*5))
plt.text(cs_wiki, cs_twit+.02, f'{w1}-{w2}', va='bottom', ha='center', fontsize=8)
xylims = [.05, .95]
plt.plot(xylims, xylims, '--', color=[.4, .4, .4], zorder=-30)
plt.gca().set(xlim=xylims, ylim=xylims, xlabel='Wiki inter-word similarities',
ylabel='Twitter inter-word similarities', title='Inter-word similarities')
plt.show()