Bag-of-Words representation

Natural Language Processing (NLP) in Python

Fouad Trad

Machine Learning Engineer

NLP workflow recap

The full workflow diagram mentioning that Chapters 3 and 4 focus on the transformers libraries.

Bag-of-Words (BoW)

Foundational technique to represent text as numbers
Represent text by counting how often each word appears
Throws words in a bag and counts them
Ignores grammar and order

Image showing words of a text being thrown in a bag and then individual word occurrences are being counted.

BoW example

Image showing two sentences: 'I love NLP' and 'I love machine learning.'

BoW example

Image showing the vocabulary derived from the sentences: I, love, NLP, machine, and learning.

Build a vocabulary of all unique words

BoW example

Image displaying feature vectors for each sentence, generated by counting word occurrences according to the defined vocabulary.

Build a vocabulary of all unique words
Count how many times each word from the vocabulary appears

BoW with code

reviews = ["I loved the movie. It was amazing!",
           "The movie was okay.",
           "I hated the movie. It was boring."]

def preprocess(text):

    text = text.lower()

    tokens = word_tokenize(text)

    tokens = [word for word in tokens if word not in string.punctuation]

    return " ".join(tokens)

cleaned_reviews = [preprocess(review) for review in reviews]

print(cleaned_reviews)

['i loved the movie it was amazing', 
 'the movie was okay', 
 'i hated the movie it was boring']

BoW with code

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()

vectorizer.fit(cleaned_reviews)

print(vectorizer.get_feature_names_out())

['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']

BoW output

X = vectorizer.transform(cleaned_reviews)

# OR
X = vectorizer.fit_transform(cleaned_reviews)

print(X)

<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 16 stored elements and shape (3, 9)>

Sparse matrix: table mostly filled with zeros

BoW output

print(X.toarray())

[[1 0 0 1 1 1 0 1 1]
 [0 0 0 0 0 1 1 1 1]
 [0 1 1 1 0 1 0 1 1]]

print(vectorizer.get_feature_names_out())

['amazing' 'boring' 'hated' 'it' 'loved' 'movie' 'okay' 'the' 'was']

Word frequencies

import numpy as np


word_counts = np.sum(X.toarray(), axis=0)

words = vectorizer.get_feature_names_out()

import matplotlib.pyplot as plt


plt.bar(words, word_counts)

plt.title("Word Frequencies in Movie Reviews")

plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

Bar plot showing words and their frequencies, with stop words being the most frequent.

Let's practice!

Natural Language Processing (NLP) in Python