Pengantar Pengodean Teks

Rekayasa Fitur untuk Machine Learning di Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Standarisasi teks Anda

Contoh teks bebas:

Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the th day of the present month.

Rekayasa Fitur untuk Machine Learning di Python

Dataset

print(speech_df.head())
                  Name           Inaugural Address    \ 
0    George Washington     First Inaugural Address
1    George Washington    Second Inaugural Address
2    John Adams                  Inaugural Address    
3    Thomas Jefferson      First Inaugural Address    
4    Thomas Jefferson     Second Inaugural Address

                        Date                               text
0    Thursday, April 30, 1789    Fellow-Citizens of the Sena...
1       Monday, March 4, 1793    Fellow Citizens: I AM again...
2     Saturday, March 4, 1797    WHEN it was first perceived...
3    Wednesday, March 4, 1801    Friends and Fellow-Citizens...
4       Monday, March 4, 1805    PROCEEDING, fellow-citizens...
Rekayasa Fitur untuk Machine Learning di Python

Menghapus karakter tak diinginkan

  • [a-zA-Z]: Semua huruf
  • [^a-zA-Z]: Semua selain huruf
speech_df['text'] = speech_df['text']\
                   .str.replace('[^a-zA-Z]', ' ')
Rekayasa Fitur untuk Machine Learning di Python

Menghapus karakter tak diinginkan

Sebelum:

"Fellow-Citizens of the Senate and of the House of  
Representatives: AMONG the vicissitudes incident to   
life no event could have filled me with greater" ...

Sesudah:

"Fellow Citizens of the Senate and of the House of  
Representatives AMONG the vicissitudes incident to   
life no event could have filled me with greater" ...
Rekayasa Fitur untuk Machine Learning di Python

Standarkan huruf besar/kecil

speech_df['text'] = speech_df['text'].str.lower()
print(speech_df['text'][0])
"fellow citizens of the senate and of the house of  
representatives among the vicissitudes incident to   
life no event could have filled me with greater"...
Rekayasa Fitur untuk Machine Learning di Python

Panjang teks

speech_df['char_cnt'] = speech_df['text'].str.len()
print(speech_df['char_cnt'].head())
0    1889  
1     806  
2    2408  
3    1495  
4    2465
Name: char_cnt, dtype: int64
Rekayasa Fitur untuk Machine Learning di Python

Hitung kata

speech_df['word_cnt'] = 
    speech_df['text'].str.split()
speech_df['word_cnt'].head(1)
['fellow', 'citizens', 'of', 'the', 'senate', 'and',...
Rekayasa Fitur untuk Machine Learning di Python

Hitung kata

speech_df['word_counts'] = 
    speech_df['text'].str.split().str.len()
print(speech_df['word_splits'].head())
0    1432
1     135
2    2323
3    1736
4    2169
Name: word_cnt, dtype: int64
Rekayasa Fitur untuk Machine Learning di Python

Rata-rata panjang kata

speech_df['avg_word_len'] = 
         speech_df['char_cnt'] / speech_df['word_cnt']
Rekayasa Fitur untuk Machine Learning di Python

Ayo berlatih!

Rekayasa Fitur untuk Machine Learning di Python

Preparing Video For Download...