Embeddings

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

Limitations of BoW and TF-IDF

  • Treating similar words as completely unrelated
  • Failing to capture meaning of text

 

Image showing how BoW/TF-IDF transform a dataset of multiple texts into a structured dataset where unique words are features.

Natural Language Processing (NLP) in Python

Embeddings

Image showing that embedding transforms a word into a vector.

  • Represent a word with a vector that captures its meaning
Natural Language Processing (NLP) in Python

Embeddings

Image showing the embedding vectors of four different words: car, movie, film, and king.

  • Represent a word with a vector that captures its meaning
    • Assigns random values to each word
Natural Language Processing (NLP) in Python

Embeddings

Image showing the sentence completion task where the goal is to complete a given sentence such as "I enjoyed watching the" with the right word.

  • Represent a word with a vector that captures its meaning
    • Assigns random values to each word
    • Refines values by predicting missing words in sentences
Natural Language Processing (NLP) in Python

Embeddings

Image showing that we can complete the sentence with two words having a similar meaning: movie and film.

  • Represent a word with a vector that captures its meaning
    • Assigns random values to each word
    • Refines values by predicting missing words in sentences
    • Words appearing in similar contexts end up with similar representations
Natural Language Processing (NLP) in Python

Embeddings as GPS coordinates for words

Image showing GPS locations on a map.

Natural Language Processing (NLP) in Python

Gensim

  • Provides popular embedding models
    • Word2Vec
    • GloVe
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
...

 

  Gensim logo.

Natural Language Processing (NLP) in Python

Loading an embedding model

import gensim.downloader as api

model = api.load('glove-wiki-gigaword-50')
print(type(model))
print(model['movie'])
<class 'gensim.models.keyedvectors.KeyedVectors'>

[ 0.30824 0.17223 -0.23339 0.023105 0.28522 0.23076 -0.41048 -1.0035 -0.2072 1.4327 -0.80684 0.68954 -0.43648 1.1069 1.6107 -0.31966 0.47744 0.79395 -0.84374 0.064509 0.90251 0.78609 0.29699 0.76057 0.433 -1.5032 -1.6423 0.30256 0.30771 -0.87057 2.4782 -0.025852 0.5013 -0.38593 -0.15633 0.45522 0.04901 -0.42599 -0.86402 -1.3076 -0.29576 1.209 -0.3127 -0.72462 -0.80801 0.082667 0.26738 -0.98177 -0.32147 0.99823 ]
Natural Language Processing (NLP) in Python

Computing similarity

similarity = model.similarity("film", "movie")

print(similarity)
0.9310100078582764
Natural Language Processing (NLP) in Python

Finding most similar words

similar_to_movie = model.most_similar('movie', topn=3)

print(similar_to_movie)
[('movies', 0.9322481155395508), 
 ('film', 0.9310100078582764), 
 ('films', 0.8937394618988037)]
Natural Language Processing (NLP) in Python

Visualizing embeddings

  • Principal Component Analysis (PCA):
    • High-dimensional vectors → 2D or 3D vectors

A 3D map being flattened into a 2D one.

1 Image generated by DALL-E
Natural Language Processing (NLP) in Python

Visualizing embeddings with PCA

from sklearn.decomposition import PCA

words = ["film", "movie", "dog", "cat", "car", "bus"]
word_vectors = [model[word] for word in words]
pca = PCA(n_components=2)
word_vectors_2d = pca.fit_transform(word_vectors)
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1])
for word, (x, y) in zip(words, word_vectors_2d): plt.annotate(word, (x, y))
plt.show()

Image showing that similar/related words are close to each other in the embedding space.

Natural Language Processing (NLP) in Python

Comparison of embeddings

Image showing that similar/related words in the Word2Vec model are close to each other in the embedding space, but are differently placed than the GloVe model.

word2vec-google-news-300

Image showing that similar/related words in the GloVe model are close to each other in the embedding space.

glove-wiki-gigaword-50
Natural Language Processing (NLP) in Python

Let's practice!

Natural Language Processing (NLP) in Python

Preparing Video For Download...