Bag of words and N-grams

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Issues with bag of words

Positive meaning

Single word: happy

Negative meaning

Bi-gram : not happy

Positive meaning

Trigram : never not happy

Using N-grams

tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2,2))

# Fit and apply bigram vectorizer
tv_bi_gram = tv_bi_gram_vec\
               .fit_transform(speech_df['text'])

# Print the bigram features
print(tv_bi_gram_vec.get_feature_names())

[u'american people', u'best ability ',
 u'beloved country', u'best interests' ... ]

Finding common words

# Create a DataFrame with the Counts features
tv_df = pd.DataFrame(tv_bi_gram.toarray(),
                     columns=tv_bi_gram_vec.get_feature_names())\
                        .add_prefix('Counts_')

tv_sums = tv_df.sum()
print(tv_sums.head())

Counts_administration government    12
Counts_almighty god                 15
Counts_american people              36
Counts_beloved country               8
Counts_best ability                  8
dtype: int64

Finding common words

print(tv_sums.sort_values(ascending=False)).head()

Counts_united states         152
Counts_fellow citizens        97
Counts_american people        36
Counts_federal government     35
Counts_self government        30
dtype: int64

Let's practice!

Feature Engineering for Machine Learning in Python