Bag-of-words dan N-gram

Rekayasa Fitur untuk Machine Learning di Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Masalah pada bag-of-words

Makna positif

Satu kata: happy

Makna negatif

Bi-gram: not happy

Makna positif

Tri-gram: never not happy

Menggunakan N-gram

tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2,2))

# Fit and apply bigram vectorizer
tv_bi_gram = tv_bi_gram_vec\
               .fit_transform(speech_df['text'])

# Print the bigram features
print(tv_bi_gram_vec.get_feature_names())

[u'american people', u'best ability ',
 u'beloved country', u'best interests' ... ]

Menemukan kata umum

# Create a DataFrame with the Counts features
tv_df = pd.DataFrame(tv_bi_gram.toarray(),
                     columns=tv_bi_gram_vec.get_feature_names())\
                        .add_prefix('Counts_')

tv_sums = tv_df.sum()
print(tv_sums.head())

Counts_administration government    12
Counts_almighty god                 15
Counts_american people              36
Counts_beloved country               8
Counts_best ability                  8
dtype: int64

Menemukan kata umum

print(tv_sums.sort_values(ascending=False)).head()

Counts_united states         152
Counts_fellow citizens        97
Counts_american people        36
Counts_federal government     35
Counts_self government        30
dtype: int64

Ayo berlatih!

Rekayasa Fitur untuk Machine Learning di Python