Feature Engineering for Machine Learning in Python
Robert O'Callaghan
Director of Data Science, Ordergroove
Single word: happy
Bi-gram : not happy
Trigram : never not happy
tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2,2))
# Fit and apply bigram vectorizer
tv_bi_gram = tv_bi_gram_vec\
.fit_transform(speech_df['text'])
# Print the bigram features
print(tv_bi_gram_vec.get_feature_names())
[u'american people', u'best ability ',
u'beloved country', u'best interests' ... ]
# Create a DataFrame with the Counts features
tv_df = pd.DataFrame(tv_bi_gram.toarray(),
columns=tv_bi_gram_vec.get_feature_names())\
.add_prefix('Counts_')
tv_sums = tv_df.sum()
print(tv_sums.head())
Counts_administration government 12
Counts_almighty god 15
Counts_american people 36
Counts_beloved country 8
Counts_best ability 8
dtype: int64
print(tv_sums.sort_values(ascending=False)).head()
Counts_united states 152
Counts_fellow citizens 97
Counts_american people 36
Counts_federal government 35
Counts_self government 30
dtype: int64
Feature Engineering for Machine Learning in Python