Representasi Jumlah Kata

Rekayasa Fitur untuk Machine Learning di Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Teks ke kolom

Rekayasa Fitur untuk Machine Learning di Python

Inisialisasi vektorisasi

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
print(cv)
CountVectorizer(analyzer=u'word', binary=False, 
        decode_error=u'strict', 
        dtype=<type 'numpy.int64'>, 
        encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, 
        min_df=1,ngram_range=(1, 1), preprocessor=None, 
        stop_words=None, strip_accents=None, 
        token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None
Rekayasa Fitur untuk Machine Learning di Python

Menentukan vektorisasi

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.1, max_df=0.9)

min_df: fraksi minimum dokumen tempat kata harus muncul max_df: fraksi maksimum dokumen tempat kata boleh muncul

Rekayasa Fitur untuk Machine Learning di Python

Fit vektorisasi

cv.fit(speech_df['text_clean'])
Rekayasa Fitur untuk Machine Learning di Python

Transformasi teks Anda

cv_transformed = cv.transform(speech_df['text_clean'])
print(cv_transformed)
<58x8839 sparse matrix of type '<type 'numpy.int64'>'
Rekayasa Fitur untuk Machine Learning di Python

Transformasi teks Anda

cv_transformed.toarray()
Rekayasa Fitur untuk Machine Learning di Python

Mengambil fitur

feature_names = cv.get_feature_names()
print(feature_names)
[u'abandon', u'abandoned', u'abandonment', u'abate', 
u'abdicated', u'abeyance', u'abhorring', u'abide',
u'abiding', u'abilities', u'ability', u'abject'...
Rekayasa Fitur untuk Machine Learning di Python

Fit dan transformasi

cv_transformed = cv.fit_transform(speech_df['text_clean'])
print(cv_transformed)
<58x8839 sparse matrix of type '<type 'numpy.int64'>'
Rekayasa Fitur untuk Machine Learning di Python

Menggabungkan semuanya

cv_df = pd.DataFrame(cv_transformed.toarray(), 
                     columns=cv.get_feature_names())\
                               .add_prefix('Counts_')
print(cv_df.head())
     Counts_aback    Counts_abandoned    Counts_a...
0               1                   0        ...
1               0                   0        ...
2               0                   1        ...
3               0                   1        ...
4               0                   0        ...
1 ```out Counts_aback Counts_abandon Counts_abandonment 0 1 0 0 1 0 0 1 2 0 1 0 3 0 1 0 4 0 0 0 ```
Rekayasa Fitur untuk Machine Learning di Python

Memperbarui DataFrame Anda

speech_df = pd.concat([speech_df, cv_df], 
                      axis=1, sort=False)
print(speech_df.shape)
(58, 8845)
Rekayasa Fitur untuk Machine Learning di Python

Ayo berlatih!

Rekayasa Fitur untuk Machine Learning di Python

Preparing Video For Download...