Introduction to Natural Language Processing
In this session, we'll learn about the basic concepts of Natural Language Processing (NLP) using a restaurant reviews dataset from Yelp.
- Distributional Hypothesis : A word is characterized by the company it keeps
- Vector Semantics : How to implement the above idea to get a computer to learn the meanings of words directly from text?
- Co-occurrence Matrix: A data structure to hold the counts of how often words co-occur in a given text
- Transforming Raw Counts: Methods to transform raw counts in the above matrix to extract useful information
Note: if you're running this in Google Colab, please save the notebook to your drive first before executing it.
!wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz
!tar xvfz yelp_review_full_csv.tgz
#
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import svds
import pandas as pd
import seaborn as sns
import spacy
nlp = spacy.load('en')
pd.options.display.max_colwidth=-1
#
Browse Restaurant Reviews from Yelp ¶
DATA_DIR = "./yelp_review_full_csv/"
df = pd.read_csv(DATA_DIR + "train.csv", header=None, names=['rating', 'review'])
df[df.review.str.contains("indian")].sample(3)
#
Extract cuisines using Named Entity Recognition (NER) ¶
The restaurants that are reviewed serve a variety of cuisines. Typically, the words denoting cuisine type (indian, chinese, mexican etc.) are the same as nationalities. So we'll use a model called named entity recognition implemented in Spacy to extract these cuisine tokens. As an example, let's extract tokens of type GPE and NORP.
- GPE: Geo-political entity
- NORP: Nationalities, religious, political groups
text = """Definitely not the best Indian food I have had.
Good for them they are in Vegas where other Indian restaurants
are pretty bad as well."""
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent.lower_, ent.start, ent.end)
doc[4:5], doc[17:18]
Build a cuisine X word co-occurrence matrix ¶
Let's build a matrix with rows as cuisine tokens and the corresponding columns containing nearby words (within a window of size 10). As the rows and columns of the matrix are indexed by numbers, we need to construct the following maps.
-
cuisine2idx: map cuisine to index
- idx2cuisine: map index to cuisine
-
token2idx: map other tokens to indices
- idx2token: map index to token
def get_cuisine_idx(cuisine):
return cuisine2idx.setdefault(cuisine, len(cuisine2idx))
def get_token_idx(token):
return token2idx.setdefault(token, len(token2idx))
def build_cuisine_token_mat(df,ws=10):
row = []
col = []
data = []
def do_count_tokens(cuisine, tokens):
for tok in tokens:
if len(tok.ent_type_) > 0 or tok.is_stop or tok.is_punct:
continue
row.append(get_cuisine_idx(cuisine.lower_))
col.append(get_token_idx(tok.lower_))
data.append(1.)
for doc in nlp.pipe(df.review):
for ent in doc.ents:
# NORP: nationalities, religious, political groups
if ent.label_ == "NORP":
# count left side window
do_count_tokens(ent, doc[ent.start-ws:ent.start])
# count right side window
do_count_tokens(ent, doc[ent.end:ent.end+ws])
return sparse.csr_matrix((data, (row, col)),
shape=(len(cuisine2idx), len(token2idx)))
#
%%time
np.random.seed(42)
cuisine2idx = {}
token2idx = {}
small_df = df.sample(20000)
cuisine_token_mat = build_cuisine_token_mat(small_df)
#
Let's view the top 20 cuisines ¶
(index, cuisine)
idx2token = {v:k for k,v in token2idx.items()}
idx2cuisine = {v:k for k,v in cuisine2idx.items()}
def get_top_cuisine_indices(mat,n):
rowsum = np.squeeze(np.asarray(mat.sum(axis=1)))
return np.argsort(rowsum)[::-1][:n]
for idx in get_top_cuisine_indices(cuisine_token_mat,20):
print(idx,idx2cuisine[idx])
#
And the top tokens that occur near cuisines ¶
def get_top_token_indices(mat,n):
rowsum = np.squeeze(np.asarray(mat.sum(axis=0)))
return np.argsort(rowsum)[::-1][:n]
for idx in get_top_token_indices(cuisine_token_mat,20):
print(idx,idx2token[idx])
#
Co-occurrence matrices are typically sparse ¶
print(f"Shape of cuisine x token: {cuisine_token_mat.shape}")
print(f"Number of non-zero entries: {cuisine_token_mat.nnz}")
print(f"Non-zero entries as a % of total size: {np.round(cuisine_token_mat.nnz/np.prod(cuisine_token_mat.shape) * 100,3)}%")
#
Visualize top cuisines x top tokens ¶
def get_cuisine_token_df(mat,n=25):
cuisine_indices = get_top_cuisine_indices(mat, n)
token_indices = get_top_token_indices(mat, n)
data = []
co_mat = np.zeros((len(cuisine_indices), len(token_indices)))
for i,cidx in zip(range(n),cuisine_indices):
for j, jidx in zip(range(n),token_indices):
data.append((idx2cuisine[cidx], idx2token[jidx],
cuisine_token_mat[cidx, jidx]))
columns=['cuisine','token','freq']
df = pd.DataFrame(data,columns=columns)
return df.pivot(*columns)
ct_df = get_cuisine_token_df(cuisine_token_mat)
ct_df.drop(columns=[' ','food'], inplace=True)
fig = plt.figure(figsize=(10, 10))
sns.set(style='white')
ax = sns.heatmap(ct_df,
cmap="coolwarm",
annot=True,
center=0, square=True,
linewidths=0.2,cbar_kws={"shrink": 0.5})
#
Co-occurrence matrix reweighting using Pointwise Mutual Information ¶
Positive Pointwise Mutual Information
- See section 6.7
Pointwise mutual information is one of the most important concepts in NLP. It is a measure of how often two events x and y occur, compared with what we would expect if they were independent.
\begin{align} \text{PMI(w,c)} = \text{log}_2\frac{P(w,c)}{P(w)P(c)} \end{align}def pmi(xy, x, y, N):
"""
Point-wise mutual information score of two events x, y
N is the sample size of the universe.
"""
return np.log2(xy) + np.log2(N) - np.log2(x) - np.log2(y)
def ppmi(xy, x, y, N, pmi_threshold):
val = pmi(xy, x, y, N)
return 0 if val < pmi_threshold else val
def build_ppmi_mat(csr_mat, count_threshold, pmi_threshold):
below_threshold = np.sum(csr_mat.data < count_threshold)
total = csr_mat.nnz
print(f"Pct. of elements below count_threshold {below_threshold/total} / total {total}")
csr_mat = csr_mat.copy()
csr_mat.data[csr_mat.data < count_threshold] = 0
csr_mat.eliminate_zeros()
print(f"Number of elements after zeroing out data: {csr_mat.nnz}")
coo = csr_mat.tocoo()
rowsum = np.squeeze(np.asarray(coo.sum(axis=1)))
colsum = np.squeeze(np.asarray(coo.sum(axis=0)))
N = coo.sum()
data = [ppmi(count, rowsum[r], colsum[c], N, pmi_threshold)
for r,c,count in zip(coo.row, coo.col, coo.data)]
ppmi_csr = sparse.csr_matrix((data, (coo.row, coo.col)), shape=coo.shape)
ppmi_csr.eliminate_zeros()
return ppmi_csr
ppmi_mat = build_ppmi_mat(cuisine_token_mat, 3, 0.05)
#
Co-occurrence vs. PMI ¶
Let's compare the raw counts and the corresponding PMI values for a sample of (cuisine, token) pairs.
cuisines_tokens = [('indian','food'),
('indian','restaurant'),
('indian','spicy'),
('indian','curry'),
('indian','basmati')
]
for c,t in cuisines_tokens:
print(f"{c+' '+t:^48s}")
cidx = cuisine2idx[c]
tidx = token2idx[t]
print(f"CO: {cuisine_token_mat[cidx, tidx]}, PPMI: {np.round(ppmi_mat[cidx, tidx], 2)}")
#
Let's also look at the common co-occurring tokens for some well known cuisines to see whether PMI gives a better result than raw counts.
def print_top_tokens_for_cuisine(mat, cuisine, n=10):
cidx = cuisine2idx[cuisine]
vec = mat[cidx]
indices = np.argsort(vec.data)[::-1][:n]
for idx in indices:
print(idx2token[vec.indices[idx]], end=', ')
cuisines = ['indian','french','mexican','japanese']
for c in cuisines:
print(f"{c:^48s}")
print(f"{'Co-occurrence':^48s}")
print_top_tokens_for_cuisine(cuisine_token_mat, c)
print()
print(f"{'PPMI':^48s}")
print_top_tokens_for_cuisine(ppmi_mat, c)
print("\n")
#
Low Dimensional Representation of Cuisines ¶
As we mentioned above, the co-occurrence matrix is typically sparse. In the data sample we've, as we're working with only cuisine tokens, this may not be a serious issue. However, when dealing with large number of tokens in the general case, we need to learn a low dimensional vector representation for each token. We'll demonstrate this below using a technique called Singular Value Decomposition (SVD) on the PMI matrix.
U,S,VT = svds(ppmi_mat, k=30)
word_vectors = np.vstack((U,VT.T))
norm = np.linalg.norm(word_vectors,axis=1,keepdims=True)
norm_word_vectors = np.divide(word_vectors,norm,out=np.zeros_like(word_vectors),
where=norm!=0)
def show_nearest_neighbours(cuisine, n=10):
idx = cuisine2idx[cuisine]
vec = norm_word_vectors[idx]
for oidx in np.argsort(norm_word_vectors.dot(vec))[::-1][:n+1]:
if oidx == idx:
continue
if oidx < len(idx2cuisine):
print(idx2cuisine[oidx], end=', ')
else:
print(idx2token[oidx-len(idx2cuisine)], end=', ')
#
Let's now compare the co-occurring tokens for some popular cuisines (PMI vs. PMI + SVD)
cuisines = ['indian','french','mexican','japanese']
for c in cuisines:
print(f"{c:^48s}")
print(f"{'PPMI':^48s}")
print_top_tokens_for_cuisine(ppmi_mat, c)
print()
print(f"{'PPMI + SVD':^48s}")
show_nearest_neighbours(c)
print("\n")
#
Recap ¶
You shall know a word by the company it keeps
Relevant excerpts from Chapter 6 of Speech and Language Processing: Daniel Jurafsky & James H Martin:
- Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis . The hypothesis was first formulated in the 1950s by linguists like Joos (1950), Harris (1954), and Firth (1957), who noticed that words which are synonyms (like oculist and eye-doctor) tended to occur in the same environment (e.g., near words like eye or examined) with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments”.
- A vector semantics model instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts. These representations are used in every natural language processing application that makes use of meaning.
- An important focus of recent NLP research is representation learning , i.e. unsupervised methods to learn vector representations of the input, instead of creating representations manually by engineering features.