++++

Mar 2025×12 min read

<Highlight>Global Vectors</Highlight> for Word Representation (GloVe) combines the best of global statistics and local context to create powerful, semantically rich word embeddings.

GloVe Embeddings 🧤

Driptanil DattaSoftware Developer

GloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words. It differs from methods like Word2Vec by explicitly leveraging the global co-occurrence statistics of a corpus, rather than just local context windows.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

GloVe works on the insight that the ratio of word-word co-occurrence probabilities contains significant semantic information.

Global Co-occurrence: It builds a massive matrix of how often every word appears near every other word in the entire dataset (e.g., Wikipedia).
Matrix Factorization: It simplifies this information into a lower-dimensional space (e.g., 50 or 300 dimensions).
Linear Relationships: The resulting vectors maintain powerful linear relationships, allowing for famous analogies like King - Man + Woman = Queen.

1. Loading a GloVe Model

We'll use a pre-trained small GloVe model (50 dimensions) trained on Wikipedia and Gigaword data. This model is lightweight and perfect for exploration.

# download a small GloVe model (Wikipedia + Gigaword, 50D)
 
# NOTE: If you get errors importing, run the following !pip... line,
# then restart your session (from Runtime menu) and comment out the pip line.
# !pip install gensim
 
import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-50')

2. Environment Setup

We'll use standard data science tools: numpy for matrix operations, scipy for statistics, and matplotlib/seaborn for visualization.

import numpy as np
import scipy
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
 
# svg plots
# import matplotlib_inline.backend_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

3. Inspecting the Model Properties

Let's see what's inside the glove object provided by the gensim library.

# check the properties and methods
dir(glove)

OUTPUT

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',

4. Exploring the Vocabulary

Our model contains 400,000 unique tokens. The most frequent tokens are common English words and punctuation.

print(f'The dictionary contains {len( glove.key_to_index.keys())} items.' )
list(glove.key_to_index.keys())[:50]

OUTPUT

The dictionary contains 400000 items.

OUTPUT

['the',
 ',',
 '.',
 'of',
 'to',

5. Random Word Samples

To get a sense of the diversity, let's pull 10 random words from the vocabulary.

# print 10 words at random
for idx in np.random.randint(0,len(glove.key_to_index),10):
  print(f'Index {idx:>6} is "{glove.index_to_key[idx]}"')

OUTPUT

Index 277473 is "posillipo"
Index 174657 is "fujikura"
Index 377518 is "ex-guitarist"
Index  41190 is "u-boats"
Index 140299 is "jota"

6. Distribution of Word Lengths

Most words in English are relatively short. We can visualize the distribution of character lengths across the entire 400k vocabulary.

# distribution of token character lengths
token_lengths = np.zeros(len( glove.key_to_index.keys()),dtype=int)
for idx,word in enumerate( glove.key_to_index.keys() ):
  token_lengths[idx] = len(word)
 
# counts for the bar plot
uniqVals,uniqCounts = np.unique(token_lengths,return_counts=True)
 
 
# visualize the distribution of lengths
plt.figure(figsize=(12,4))
plt.bar(uniqVals,np.log(uniqCounts),width=uniqVals[1]-uniqVals[0],facecolor=[.9,.7,.9],edgecolor='k')
plt.gca().set(xlabel='Word length (num characters)',ylabel='Count')
 
plt.show()

Output 1

7. The Embeddings Matrix

The core of the model is a massive $400,000 \times 50$ matrix. Every word is represented by a single row (vector) in this matrix.

# size of the embeddings matrix
print(f'The embeddings matrix is {glove.vectors.shape}')
 
print(f'The word "apple" has index #{glove.key_to_index["apple"]}')
 
# can also access it this way:
glove.get_index('apple')

OUTPUT

The embeddings matrix is (400000, 50)
The word "apple" has index #3292

OUTPUT

8. Visualizing the Matrix

Visualizing the transposed matrix gives us a global view of the vector values across all dimensions and indices.

plt.figure(figsize=(12,4))
plt.imshow(glove.vectors.T,vmin=-1,vmax=1,aspect='auto')
plt.gca().set(ylabel='Dimension',xlabel='Word index',title='Embeddings matrix')
plt.colorbar(pad=.01)
plt.show()

Output 2

9. Statistical Distribution

We can use a joint plot to see the relationship between the mean and standard deviation of values across the embedding dimensions.

# mean and std across each embedding dim
emb_mean = glove.vectors.mean(axis=1)
emb_std  = glove.vectors.std(axis=1)
 
 
# seaborn has nice visualization routines
import seaborn as sns
import pandas as pd # though seaborn only works on pandas dataframes :/
 
df = pd.DataFrame(np.vstack((emb_mean,emb_std)).T,columns=['Mean','std'])
 
sns.jointplot(x='Mean',y='std',data=df,alpha=.2)
plt.show()

Output 3

10. Individual Word Vectors

Let's zoom in on a single word, "banana", and look at its 50-dimensional vector representation.

# pick a word
word = 'banana'
 
# get its index in the embeddings matrix
wordidx = glove.key_to_index[word]
 
# get the embedding vector
thisWordVector = glove.vectors[wordidx,:]
 
# inspect the vector
print(f'The embedding vector for "{word}" is\n {thisWordVector}')

OUTPUT

The embedding vector for "banana" is
 [-0.25522  -0.75249  -0.86655   1.1197    0.12887   1.0121   -0.57249
 -0.36224   0.44341  -0.12211   0.073524  0.21387   0.96744  -0.068611
  0.51452  -0.053425 -0.21966   0.23012   1.043    -0.77016  -0.16753
 -1.0952    0.24837   0.20019  -0.40866  -0.48037   0.10674   0.5316

11. Plotting a Word Profile

By plotting the vector values, we can see the "fingerprint" of a word in the embedding space.

# visualize it
plt.figure(figsize=(10,4))
plt.plot(glove.vectors[wordidx,:],'ks',markersize=10,markerfacecolor=[.7,.7,.9])
 
plt.xlabel('Dimension')
plt.title(f'Embedding vector for "{word}"')
plt.show()

Output 4

12. Measuring Semantic Similarity

The power of embeddings lies in their geometry. Words with similar meanings (like "banana" and "apple") will have similar vector profiles and a high Cosine Similarity score compared to unrelated words (like "cosmic").

# pick three words
word1 = 'banana'
word2 = 'apple'
word3 = 'cosmic'
 
 
# setup the figure subplot geometry
fig = plt.figure(figsize=(10,7))
gs = GridSpec(2,2)
ax0 = fig.add_subplot(gs[0,:])
ax1 = fig.add_subplot(gs[1,0])
ax2 = fig.add_subplot(gs[1,1])
 
# plot the embeddings by dimension
for idx,word in enumerate([word1,word2,word3]):
  ax0.plot(glove[word],'s-',label=word)
 
ax0.set(xlabel='Dimension',title='Embeddings',xlim=[-1,glove.vectors.shape[1]+1])
ax0.legend()
 
 
# plot the embeddings by each other
cossim = glove.similarity(word1,word2)
ax1.plot(glove[word1],glove[word2],'ko',markerfacecolor=[.9,.7,.7])
ax1.set(xlabel=word1,ylabel=word2,title=f'Cosine similarity = {cossim:.3f}')
 
cossim = glove.similarity(word1,word3)
ax2.plot(glove[word1],glove[word3],'ko',markerfacecolor=[.7,.9,.7])
ax2.set(xlabel=word1,ylabel=word3,title=f'Cosine similarity = {cossim:.3f}')
 
# final touches
plt.tight_layout()
plt.show()

Output 5

13. Word Analogies and Anomalies

GloVe allows us to perform sophisticated semantic queries, such as finding the most similar words or identifying which word doesn't fit in a list.

# most similar words ("similar" is high cosine similarity)
glove.most_similar('fashion',topn=9)

OUTPUT

[('style', 0.760734498500824),
 ('fashions', 0.7528777122497559),
 ('designer', 0.7515820860862732),
 ('chic', 0.7511471509933472),
 ('designers', 0.7450659275054932),

# One these things is not like the others...
lists = [ [ 'apple','banana','pirate','peach' ],
          [ 'apple','banana','peach','kiwi','starfruit' ],
          [ 'apple','banana','pirate','peach','kiwi','starfruit' ],
          [ 'apple','banana','orange','kiwi' ]
        ]
 
for l in lists:
  print(f'In the word list {l}:')
  print(f'  The most similar word is "{glove.most_similar(l,topn=1)[0][0]}"')
  print(f'  and the non-matching word is "{glove.doesnt_match(l)}"\n')

OUTPUT

In the word list ['apple', 'banana', 'pirate', 'peach']:
  The most similar word is "mango"
  and the non-matching word is "pirate"

In the word list ['apple', 'banana', 'peach', 'kiwi', 'starfruit']:

18. Claude Tokenizer Wikipedia vs. Twitter 🐦