In [10]:

import warnings
warnings.filterwarnings('ignore')

In [11]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
!pip install -U torchtext
import torch
import torchtext
from torchtext.vocab import GloVe
import umap
pd.options.display.max_colwidth=-1
#

Requirement already up-to-date: torchtext in /usr/local/lib/python3.6/dist-packages (0.8.1)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from torchtext) (2.23.0)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.6/dist-packages (from torchtext) (1.19.5)
Requirement already satisfied, skipping upgrade: torch==1.7.1 in /usr/local/lib/python3.6/dist-packages (from torchtext) (1.7.1)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from torchtext) (4.41.1)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (1.24.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (2020.12.5)
Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from torch==1.7.1->torchtext) (0.8)
Requirement already satisfied, skipping upgrade: typing-extensions in /usr/local/lib/python3.6/dist-packages (from torch==1.7.1->torchtext) (3.7.4.3)

A word embedding is a way to represent a word in a language as a vector of numbers. In this post, we'll examine word vectors created by GloVe, global vectors for word representation , an unsupervised learning algorithm. These vectors are learned from word x word co-occurrence statistics that we've seen in the previous post.

Visualize and explore the neighborhood of word vectors ¶

The next cell takes some time (~25 minutes) to execute in Colab as it downloads GloVe and initializes the word vectors.

In [3]:

glove = GloVe(name = '840B', dim = 300)

def get_word_vec(word):
  assert word in glove.stoi, f"{word} is OOV!"
  vec = glove.vectors[glove.stoi[word]]
  return vec


def cosine_similarity(word1, word2):
  vec1 = get_word_vec(word1)
  vec2 = get_word_vec(word2)
  return torch.cosine_similarity(vec1.unsqueeze(0),
                                 vec2.unsqueeze(0)).item()


def get_nearest_neighbours(word_or_vec, n=10):
    vec = None
    if isinstance(word_or_vec, str):
        vec = get_word_vec(word_or_vec)
    else:
        vec = word_or_vec
        
    topn =  torch.argsort(torch.mv(glove.vectors, vec), descending=True)[:n]
        
    return [glove.itos[idx] for idx in topn]

def plot_embeddings(words):
  fig, axes = plt.subplots(nrows=len(words), 
                           ncols=1, 
                           sharex=True, 
                           figsize=(20,len(words)))

  for ax, w in zip(axes.flat, words):
    im = ax.imshow(glove.vectors[glove.stoi[w]].view(1,-1), 
                   cmap=plt.cm.coolwarm)
    ax.set_aspect('auto')

  fig.subplots_adjust(right=0.8)
  cbar_ax = fig.add_axes([0.85, 0.15, 0.01, 0.7])
  fig.colorbar(im, cax=cbar_ax)
#

Check that GloVe is downloaded and initalized correctly. There should be approximately 2.2 M words with a corresponding vector with 300 numbers, i.e. each word has a 300 dimensional vector associated with it. ¶

In [6]:

glove.vectors.shape

Out[6]:

torch.Size([2196017, 300])

Let's consider these 5 words.

gurgaon: a city near the capital of India
paratha: a flatbread popular in the Indian subcontinent
dosa: a thin pancake like food popular in India, especially in the south
pizza: a popular dish, originally from Italy
pasta: a popular dish from Italy

The corresponding word vectors are shown below: blue indicates negative numbers and red indicates positive numbers. As you can see, pizza and pasta look very similar. Paratha and dosa appear similar to each other. Gurgaon, an Indian city, is not very similar to any of the food items. However, it seems closer to the foods from India than from Italy.

In [7]:

words = ['gurgaon','paratha','dosa', 'pizza', 'pasta']
plot_embeddings(words)
#

Let's compute the cosine similarity, a measure of the angle between these vectors. Notice that the numbers agree with what we saw visually earlier.

The Indian food items are more similar to each other
The Italian food items are more similar to each other. They're also more similar to the Indian food items than to the Indian city.

In [8]:

E = torch.stack([get_word_vec(w) for w in words], dim=0)

from sklearn.metrics.pairwise import cosine_similarity as mat_cosine_similarity

csim = mat_cosine_similarity(E)
csim.round(3)
#

Out[8]:

array([[1.   , 0.137, 0.14 , 0.018, 0.004],
       [0.137, 1.   , 0.691, 0.305, 0.371],
       [0.14 , 0.691, 1.   , 0.309, 0.38 ],
       [0.018, 0.305, 0.309, 1.   , 0.737],
       [0.004, 0.371, 0.38 , 0.737, 1.   ]], dtype=float32)

We can visualize the matrix above as follows, with yellow corresponding to higher similarity (i.e. smaller angles between vectors) and darker colors corresponding to lower similarity (i.e. larger angles).

In [ ]:

plt.imshow(csim)
plt.colorbar();

How far apart are foods and places? ¶

Let's construct two lists, one with foods and another with place names. We'll then reduce the corresponding 300 dimensional vectors to a two dimensional space using an algorithm called UMAP , which preserves the global structure of the data to a certain extent in the 2D space.

As we see below, the vectors for the food words (blue) are clustered together and far apart from the place words (orange).

In [12]:

foods = ['chapati','chana','bhatura','chole','dal',
         'poha','idli','upma','dosa','puri',
         'paratha','jalebi','kachori','khichdi','kofta']

places = ['gurgaon','faridabad','lucknow','noida','delhi',
          'koramangala','bangalore','goa','panaji','coorg',
          'ladakh','amritsar','hyderabad','mumbai','varanasi'
         ]

food_places_df = pd.DataFrame(data={'word' : foods + places, 
                                    'type' : ['food'] * len(foods) + 
                                             ['place'] * len(places)})
E = torch.stack([get_word_vec(row.word) for _,row in food_places_df.iterrows()], 
                 dim=0)

fit = umap.UMAP(n_neighbors=5, random_state=42, metric='cosine')
u = fit.fit_transform(E)

fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(u[:,0], u[:,1], ax=ax, hue=food_places_df.type, s=100);
#

Let's also visualize the vectors of a random sample of 10000 words, reduced to 2D using UMAP.

In [13]:

num_points = 10000
indices = np.random.choice(glove.vectors.shape[0],size=num_points, 
                           replace=False)

E = glove.vectors[indices]
fit = umap.UMAP(n_neighbors=10, random_state=42, metric='cosine')
u = fit.fit_transform(E)
fig, ax = plt.subplots(figsize=(18,8))
sns.scatterplot(u[:,0], u[:,1], ax=ax, s=10);
#

We've seen above that word vectors (i.e. points in a 300 dimensional space) capture interesting semantic information. Let's now investigate whether directions in this space carry semantic information as well.

From feminine words to masculine words ¶

Let's consider a sample of pairs of gender words

she, he
her, his
woman, man
daughter, son
mother, father

Let $f$ denote the vector corresponding to the feminine gender and $m$ denote the vector corresponding to the masculine gender. What happens if we do $f + (m - f)$? Will we get back the vector $m$?

Let's first visualize the masculine and femine word vectors. Notice that we can travel downwards from feminine words to get to masculine words.

In [14]:

gender_pairs = [('she','he'),
                ('her','his'),
                ('woman','man'),
                ('daughter','son'),
                ('mother','father'),
               ]
words = [el for pair in gender_pairs for el in pair]
gender_df = pd.DataFrame(data={'gender' : ['female', 'male'] * len(gender_pairs),
                               'word' : words
                              })
X = torch.stack([get_word_vec(w) for w in gender_df.word], dim=0)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([('scaler',StandardScaler()),
                 ('pca', PCA(n_components=2))
                ])

X_reduced = pipe.fit_transform(X)

fig, ax = plt.subplots(figsize=(12,4))

sns.scatterplot(X_reduced[:,0], X_reduced[:,1], ax=ax, hue=gender_df.gender, s=60)

for i, word in enumerate(words):
    ax.annotate(word, (X_reduced[i,0], X_reduced[i,1]))
    
for i in range(0, X_reduced.shape[0], 2):
    ax.plot(X_reduced[i:i+2,0], X_reduced[i:i+2,1], c='gray', linewidth=1)
#

Let's construct an "average" opposite gender vector which we can add to feminine words in order to get to the corresponding masculine word.

In [17]:

vecs = []
for i in range(0, X.shape[0], 2):
    a = X[i,:]
    b = X[i+1,:]
    mean_vec = (a+b)/2
    vecs.append(a-mean_vec)
    vecs.append(b-mean_vec)
X1 = torch.stack(vecs, dim=0)

vecs = []
for i in range(0, X1.shape[0], 2):
    a = X[i,:]
    b = X[i+1,:]
    vecs.append(b-a)
X2 = torch.stack(vecs, dim=0)

to_masculine_vector = X2.mean(dim=0)

def get_masculine_word(feminine_word):
  word_vector = get_word_vec(feminine_word)
  candidates = get_nearest_neighbours(word_vector + to_masculine_vector)
  return [c for c in candidates if c not in [feminine_word, feminine_word+"s"]][0]
#

Let's test our opposite gender vector on new feminine words to see whether we can get the corresponding masculine word

In [18]:

get_masculine_word('headmistress')

Out[18]:

'headmaster'

In [19]:

get_masculine_word('daughter-in-law')

Out[19]:

'son-in-law'

In [20]:

get_masculine_word('step-mother')

Out[20]:

'step-father'

In [21]:

get_masculine_word('mother-in-law')

Out[21]:

'father-in-law'

In [22]:

get_masculine_word('step-daughter')

Out[22]:

'step-son'

In [23]:

get_masculine_word('stepdaughter')

Out[23]:

'stepson'

In [24]:

get_masculine_word('policewoman')

Out[24]:

'policeman'

In [25]:

get_masculine_word('chairwoman')

Out[25]:

'chairman'

Play with some words and see how good is the opposite gender vector! In addition to the points in the 300-dimensional space that encode semantic information, the directions between vectors also carry information!

Introduction to Word Embeddings

Visualize and explore the neighborhood of word vectors ¶

Check that GloVe is downloaded and initalized correctly. There should be approximately 2.2 M words with a corresponding vector with 300 numbers, i.e. each word has a 300 dimensional vector associated with it. ¶

How far apart are foods and places? ¶

From feminine words to masculine words ¶