Building a bag of words model

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Recap of data format for ML algorithms

For any ML algorithm,

Data must be in tabular form
Training features must be numerical

Bag of words model

Extract word tokens
Compute frequency of word tokens
Construct a word vector out of these frequencies and vocabulary of corpus

Bag of words model example

Corpus

"The lion is the king of the jungle"

"Lions have lifespans of a decade"

"The lion is an endangered species"

Bag of words model example

Vocabulary → a, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The

"The lion is the king of the jungle"

[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]

"Lions have lifespans of a decade"

[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]

"The lion is an endangered species"

[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]

Text preprocessing

Lions, lion → lion
The, the → the
No punctuations
No stopwords
Leads to smaller vocabularies
Reducing number of dimensions helps improve performance

Bag of words model using sklearn

corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])

Bag of words model using sklearn

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
       [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)

Let's practice!

Feature Engineering for NLP in Python