Bag-of-Words representation

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

NLP workflow recap

The full workflow diagram mentioning that Chapters 3 and 4 focus on the transformers libraries.

Natural Language Processing (NLP) in Python

Bag-of-Words (BoW)

  • Foundational technique to represent text as numbers
  • Represent text by counting how often each word appears
  • Throws words in a bag and counts them
  • Ignores grammar and order

Image showing words of a text being thrown in a bag and then individual word occurrences are being counted.

Natural Language Processing (NLP) in Python

BoW example

Image showing two sentences: 'I love NLP' and 'I love machine learning.'

Natural Language Processing (NLP) in Python

BoW example

Image showing the vocabulary derived from the sentences: I, love, NLP, machine, and learning.

  • Build a vocabulary of all unique words
Natural Language Processing (NLP) in Python

BoW example

Image displaying feature vectors for each sentence, generated by counting word occurrences according to the defined vocabulary.

  • Build a vocabulary of all unique words
  • Count how many times each word from the vocabulary appears
Natural Language Processing (NLP) in Python

BoW with code

reviews = ["I loved the movie. It was amazing!",
           "The movie was okay.",
           "I hated the movie. It was boring."]

def preprocess(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
return " ".join(tokens)
cleaned_reviews = [preprocess(review) for review in reviews]
print(cleaned_reviews)
['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']
Natural Language Processing (NLP) in Python

BoW with code

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
vectorizer.fit(cleaned_reviews)
print(vectorizer.get_feature_names_out())
['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']
Natural Language Processing (NLP) in Python

BoW output

X = vectorizer.transform(cleaned_reviews)

# OR X = vectorizer.fit_transform(cleaned_reviews)
print(X)
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 16 stored elements and shape (3, 9)>

Sparse matrix: table mostly filled with zeros

Natural Language Processing (NLP) in Python

BoW output

print(X.toarray())
[[1 0 0 1 1 1 0 1 1]
 [0 0 0 0 0 1 1 1 1]
 [0 1 1 1 0 1 0 1 1]]
print(vectorizer.get_feature_names_out())
['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']
Natural Language Processing (NLP) in Python

Word frequencies

import numpy as np

word_counts = np.sum(X.toarray(), axis=0)
words = vectorizer.get_feature_names_out()
import matplotlib.pyplot as plt

plt.bar(words, word_counts)
plt.title("Word Frequencies in Movie Reviews")
plt.xlabel("Words") plt.ylabel("Frequency") plt.show()

Bar plot showing words and their frequencies, with stop words being the most frequent.

Natural Language Processing (NLP) in Python

Let's practice!

Natural Language Processing (NLP) in Python

Preparing Video For Download...