Building a bag of words model

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

Recap of data format for ML algorithms

For any ML algorithm,

  • Data must be in tabular form
  • Training features must be numerical
Feature Engineering for NLP in Python

Bag of words model

  • Extract word tokens
  • Compute frequency of word tokens
  • Construct a word vector out of these frequencies and vocabulary of corpus
Feature Engineering for NLP in Python

Bag of words model example

Corpus

"The lion is the king of the jungle"
"Lions have lifespans of a decade"
"The lion is an endangered species"
Feature Engineering for NLP in Python

Bag of words model example

Vocabularya, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The

"The lion is the king of the jungle"
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]
"Lions have lifespans of a decade"
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]
"The lion is an endangered species"
[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
Feature Engineering for NLP in Python

Text preprocessing

  • Lions, lionlion
  • The, thethe
  • No punctuations
  • No stopwords
  • Leads to smaller vocabularies
  • Reducing number of dimensions helps improve performance
Feature Engineering for NLP in Python

Bag of words model using sklearn

corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])
Feature Engineering for NLP in Python

Bag of words model using sklearn

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object vectorizer = CountVectorizer()
# Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray())
array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3],
       [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)
Feature Engineering for NLP in Python

Let's practice!

Feature Engineering for NLP in Python

Preparing Video For Download...