Bag-of-words en n-grammen

Feature engineering voor Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Problemen met bag-of-words

Positieve betekenis

Enkel woord: happy

Negatieve betekenis

Bigram: not happy

Positieve betekenis

Trigram: never not happy

N-grammen gebruiken

tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2,2))

# Fit and apply bigram vectorizer
tv_bi_gram = tv_bi_gram_vec\
               .fit_transform(speech_df['text'])

# Print the bigram features
print(tv_bi_gram_vec.get_feature_names())

[u'american people', u'best ability ',
 u'beloved country', u'best interests' ... ]

Veelvoorkomende woorden vinden

# Create a DataFrame with the Counts features
tv_df = pd.DataFrame(tv_bi_gram.toarray(),
                     columns=tv_bi_gram_vec.get_feature_names())\
                        .add_prefix('Counts_')

tv_sums = tv_df.sum()
print(tv_sums.head())

Counts_administration government    12
Counts_almighty god                 15
Counts_american people              36
Counts_beloved country               8
Counts_best ability                  8
dtype: int64

Veelvoorkomende woorden vinden

print(tv_sums.sort_values(ascending=False)).head()

Counts_united states         152
Counts_fellow citizens        97
Counts_american people        36
Counts_federal government     35
Counts_self government        30
dtype: int64

Laten we oefenen!

Feature engineering voor Machine Learning in Python