Kelime torbası ve N-gramlar

Python ile Machine Learning için Özellik Mühendisliği

Robert O'Callaghan

Director of Data Science, Ordergroove

Kelime torbası ile ilgili sorunlar

Olumlu anlam

Tek kelime: happy

Olumsuz anlam

İki-gram: not happy

Olumlu anlam

Üç-gram: never not happy

N-gram kullanımı

tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2,2))

# Fit and apply bigram vectorizer
tv_bi_gram = tv_bi_gram_vec\
               .fit_transform(speech_df['text'])

# Print the bigram features
print(tv_bi_gram_vec.get_feature_names())

[u'american people', u'best ability ',
 u'beloved country', u'best interests' ... ]

Yaygın kelimeleri bulma

# Create a DataFrame with the Counts features
tv_df = pd.DataFrame(tv_bi_gram.toarray(),
                     columns=tv_bi_gram_vec.get_feature_names())\
                        .add_prefix('Counts_')

tv_sums = tv_df.sum()
print(tv_sums.head())

Counts_administration government    12
Counts_almighty god                 15
Counts_american people              36
Counts_beloved country               8
Counts_best ability                  8
dtype: int64

Yaygın kelimeleri bulma

print(tv_sums.sort_values(ascending=False)).head()

Counts_united states         152
Counts_fellow citizens        97
Counts_american people        36
Counts_federal government     35
Counts_self government        30
dtype: int64

Haydi pratik yapalım!

Python ile Machine Learning için Özellik Mühendisliği