Introduction to Word Embeddings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
!pip install -U torchtext
import torch
import torchtext
from torchtext.vocab import GloVe
import umap
pd.options.display.max_colwidth=-1
#
A word embedding is a way to represent a word in a language as a vector of numbers. In this post, we'll examine word vectors created by GloVe, global vectors for word representation , an unsupervised learning algorithm. These vectors are learned from word x word co-occurrence statistics that we've seen in the previous post.
Visualize and explore the neighborhood of word vectors ¶
The next cell takes some time (~25 minutes) to execute in Colab as it downloads GloVe and initializes the word vectors.
glove = GloVe(name = '840B', dim = 300)
def get_word_vec(word):
assert word in glove.stoi, f"{word} is OOV!"
vec = glove.vectors[glove.stoi[word]]
return vec
def cosine_similarity(word1, word2):
vec1 = get_word_vec(word1)
vec2 = get_word_vec(word2)
return torch.cosine_similarity(vec1.unsqueeze(0),
vec2.unsqueeze(0)).item()
def get_nearest_neighbours(word_or_vec, n=10):
vec = None
if isinstance(word_or_vec, str):
vec = get_word_vec(word_or_vec)
else:
vec = word_or_vec
topn = torch.argsort(torch.mv(glove.vectors, vec), descending=True)[:n]
return [glove.itos[idx] for idx in topn]
def plot_embeddings(words):
fig, axes = plt.subplots(nrows=len(words),
ncols=1,
sharex=True,
figsize=(20,len(words)))
for ax, w in zip(axes.flat, words):
im = ax.imshow(glove.vectors[glove.stoi[w]].view(1,-1),
cmap=plt.cm.coolwarm)
ax.set_aspect('auto')
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.01, 0.7])
fig.colorbar(im, cax=cbar_ax)
#
Check that GloVe is downloaded and initalized correctly. There should be approximately 2.2 M words with a corresponding vector with 300 numbers, i.e. each word has a 300 dimensional vector associated with it. ¶
glove.vectors.shape
Let's consider these 5 words.
- gurgaon: a city near the capital of India
- paratha: a flatbread popular in the Indian subcontinent
- dosa: a thin pancake like food popular in India, especially in the south
- pizza: a popular dish, originally from Italy
- pasta: a popular dish from Italy
The corresponding word vectors are shown below: blue indicates negative numbers and red indicates positive numbers. As you can see, pizza and pasta look very similar. Paratha and dosa appear similar to each other. Gurgaon, an Indian city, is not very similar to any of the food items. However, it seems closer to the foods from India than from Italy.
words = ['gurgaon','paratha','dosa', 'pizza', 'pasta']
plot_embeddings(words)
#
Let's compute the cosine similarity, a measure of the angle between these vectors. Notice that the numbers agree with what we saw visually earlier.
- The Indian food items are more similar to each other
- The Italian food items are more similar to each other. They're also more similar to the Indian food items than to the Indian city.
E = torch.stack([get_word_vec(w) for w in words], dim=0)
from sklearn.metrics.pairwise import cosine_similarity as mat_cosine_similarity
csim = mat_cosine_similarity(E)
csim.round(3)
#
We can visualize the matrix above as follows, with yellow corresponding to higher similarity (i.e. smaller angles between vectors) and darker colors corresponding to lower similarity (i.e. larger angles).
plt.imshow(csim)
plt.colorbar();
How far apart are foods and places? ¶
Let's construct two lists, one with foods and another with place names. We'll then reduce the corresponding 300 dimensional vectors to a two dimensional space using an algorithm called UMAP , which preserves the global structure of the data to a certain extent in the 2D space.
As we see below, the vectors for the food words (blue) are clustered together and far apart from the place words (orange).
foods = ['chapati','chana','bhatura','chole','dal',
'poha','idli','upma','dosa','puri',
'paratha','jalebi','kachori','khichdi','kofta']
places = ['gurgaon','faridabad','lucknow','noida','delhi',
'koramangala','bangalore','goa','panaji','coorg',
'ladakh','amritsar','hyderabad','mumbai','varanasi'
]
food_places_df = pd.DataFrame(data={'word' : foods + places,
'type' : ['food'] * len(foods) +
['place'] * len(places)})
E = torch.stack([get_word_vec(row.word) for _,row in food_places_df.iterrows()],
dim=0)
fit = umap.UMAP(n_neighbors=5, random_state=42, metric='cosine')
u = fit.fit_transform(E)
fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(u[:,0], u[:,1], ax=ax, hue=food_places_df.type, s=100);
#
Let's also visualize the vectors of a random sample of 10000 words, reduced to 2D using UMAP.
num_points = 10000
indices = np.random.choice(glove.vectors.shape[0],size=num_points,
replace=False)
E = glove.vectors[indices]
fit = umap.UMAP(n_neighbors=10, random_state=42, metric='cosine')
u = fit.fit_transform(E)
fig, ax = plt.subplots(figsize=(18,8))
sns.scatterplot(u[:,0], u[:,1], ax=ax, s=10);
#
We've seen above that word vectors (i.e. points in a 300 dimensional space) capture interesting semantic information. Let's now investigate whether directions in this space carry semantic information as well.
From feminine words to masculine words ¶
Let's consider a sample of pairs of gender words
- she, he
- her, his
- woman, man
- daughter, son
- mother, father
Let $f$ denote the vector corresponding to the feminine gender and $m$ denote the vector corresponding to the masculine gender. What happens if we do $f + (m - f)$? Will we get back the vector $m$?
Let's first visualize the masculine and femine word vectors. Notice that we can travel downwards from feminine words to get to masculine words.
gender_pairs = [('she','he'),
('her','his'),
('woman','man'),
('daughter','son'),
('mother','father'),
]
words = [el for pair in gender_pairs for el in pair]
gender_df = pd.DataFrame(data={'gender' : ['female', 'male'] * len(gender_pairs),
'word' : words
})
X = torch.stack([get_word_vec(w) for w in gender_df.word], dim=0)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler',StandardScaler()),
('pca', PCA(n_components=2))
])
X_reduced = pipe.fit_transform(X)
fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(X_reduced[:,0], X_reduced[:,1], ax=ax, hue=gender_df.gender, s=60)
for i, word in enumerate(words):
ax.annotate(word, (X_reduced[i,0], X_reduced[i,1]))
for i in range(0, X_reduced.shape[0], 2):
ax.plot(X_reduced[i:i+2,0], X_reduced[i:i+2,1], c='gray', linewidth=1)
#
Let's construct an "average" opposite gender vector which we can add to feminine words in order to get to the corresponding masculine word.
vecs = []
for i in range(0, X.shape[0], 2):
a = X[i,:]
b = X[i+1,:]
mean_vec = (a+b)/2
vecs.append(a-mean_vec)
vecs.append(b-mean_vec)
X1 = torch.stack(vecs, dim=0)
vecs = []
for i in range(0, X1.shape[0], 2):
a = X[i,:]
b = X[i+1,:]
vecs.append(b-a)
X2 = torch.stack(vecs, dim=0)
to_masculine_vector = X2.mean(dim=0)
def get_masculine_word(feminine_word):
word_vector = get_word_vec(feminine_word)
candidates = get_nearest_neighbours(word_vector + to_masculine_vector)
return [c for c in candidates if c not in [feminine_word, feminine_word+"s"]][0]
#
Let's test our opposite gender vector on new feminine words to see whether we can get the corresponding masculine word
get_masculine_word('headmistress')
get_masculine_word('daughter-in-law')
get_masculine_word('step-mother')
get_masculine_word('mother-in-law')
get_masculine_word('step-daughter')
get_masculine_word('stepdaughter')
get_masculine_word('policewoman')
get_masculine_word('chairwoman')
Play with some words and see how good is the opposite gender vector! In addition to the points in the 300-dimensional space that encode semantic information, the directions between vectors also carry information!