Introduction to Word Embeddings

In [10]:
import warnings
warnings.filterwarnings('ignore')
In [11]:
Requirement already up-to-date: torchtext in /usr/local/lib/python3.6/dist-packages (0.8.1)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from torchtext) (2.23.0)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.6/dist-packages (from torchtext) (1.19.5)
Requirement already satisfied, skipping upgrade: torch==1.7.1 in /usr/local/lib/python3.6/dist-packages (from torchtext) (1.7.1)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from torchtext) (4.41.1)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (1.24.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->torchtext) (2020.12.5)
Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from torch==1.7.1->torchtext) (0.8)
Requirement already satisfied, skipping upgrade: typing-extensions in /usr/local/lib/python3.6/dist-packages (from torch==1.7.1->torchtext) (3.7.4.3)

A word embedding is a way to represent a word in a language as a vector of numbers. In this post, we'll examine word vectors created by GloVe, global vectors for word representation , an unsupervised learning algorithm. These vectors are learned from word x word co-occurrence statistics that we've seen in the previous post.

Visualize and explore the neighborhood of word vectors

The next cell takes some time (~25 minutes) to execute in Colab as it downloads GloVe and initializes the word vectors.

In [3]:

Check that GloVe is downloaded and initalized correctly. There should be approximately 2.2 M words with a corresponding vector with 300 numbers, i.e. each word has a 300 dimensional vector associated with it.

In [6]:
glove.vectors.shape
Out[6]:
torch.Size([2196017, 300])

Let's consider these 5 words.

  1. gurgaon: a city near the capital of India
  2. paratha: a flatbread popular in the Indian subcontinent
  3. dosa: a thin pancake like food popular in India, especially in the south
  4. pizza: a popular dish, originally from Italy
  5. pasta: a popular dish from Italy

The corresponding word vectors are shown below: blue indicates negative numbers and red indicates positive numbers. As you can see, pizza and pasta look very similar. Paratha and dosa appear similar to each other. Gurgaon, an Indian city, is not very similar to any of the food items. However, it seems closer to the foods from India than from Italy.

In [7]:

Let's compute the cosine similarity, a measure of the angle between these vectors. Notice that the numbers agree with what we saw visually earlier.

  1. The Indian food items are more similar to each other
  2. The Italian food items are more similar to each other. They're also more similar to the Indian food items than to the Indian city.
In [8]:
Out[8]:
array([[1.   , 0.137, 0.14 , 0.018, 0.004],
       [0.137, 1.   , 0.691, 0.305, 0.371],
       [0.14 , 0.691, 1.   , 0.309, 0.38 ],
       [0.018, 0.305, 0.309, 1.   , 0.737],
       [0.004, 0.371, 0.38 , 0.737, 1.   ]], dtype=float32)

We can visualize the matrix above as follows, with yellow corresponding to higher similarity (i.e. smaller angles between vectors) and darker colors corresponding to lower similarity (i.e. larger angles).

In [ ]:
plt.imshow(csim)
plt.colorbar();

How far apart are foods and places?

Let's construct two lists, one with foods and another with place names. We'll then reduce the corresponding 300 dimensional vectors to a two dimensional space using an algorithm called UMAP , which preserves the global structure of the data to a certain extent in the 2D space.

As we see below, the vectors for the food words (blue) are clustered together and far apart from the place words (orange).

In [12]:

Let's also visualize the vectors of a random sample of 10000 words, reduced to 2D using UMAP.

In [13]:

We've seen above that word vectors (i.e. points in a 300 dimensional space) capture interesting semantic information. Let's now investigate whether directions in this space carry semantic information as well.

From feminine words to masculine words

Let's consider a sample of pairs of gender words

  1. she, he
  2. her, his
  3. woman, man
  4. daughter, son
  5. mother, father

Let $f$ denote the vector corresponding to the feminine gender and $m$ denote the vector corresponding to the masculine gender. What happens if we do $f + (m - f)$? Will we get back the vector $m$?

Let's first visualize the masculine and femine word vectors. Notice that we can travel downwards from feminine words to get to masculine words.

In [14]:

Let's construct an "average" opposite gender vector which we can add to feminine words in order to get to the corresponding masculine word.

In [17]:

Let's test our opposite gender vector on new feminine words to see whether we can get the corresponding masculine word

In [18]:
get_masculine_word('headmistress')
Out[18]:
'headmaster'
In [19]:
get_masculine_word('daughter-in-law')
Out[19]:
'son-in-law'
In [20]:
get_masculine_word('step-mother')
Out[20]:
'step-father'
In [21]:
get_masculine_word('mother-in-law')
Out[21]:
'father-in-law'
In [22]:
get_masculine_word('step-daughter')
Out[22]:
'step-son'
In [23]:
get_masculine_word('stepdaughter')
Out[23]:
'stepson'
In [24]:
get_masculine_word('policewoman')
Out[24]:
'policeman'
In [25]:
get_masculine_word('chairwoman')
Out[25]:
'chairman'

Play with some words and see how good is the opposite gender vector! In addition to the points in the 300-dimensional space that encode semantic information, the directions between vectors also carry information!