Rekayasa Fitur untuk Machine Learning di Python
Robert O'Callaghan
Director of Data Science, Ordergroove
Contoh teks bebas:
Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the th day of the present month.
print(speech_df.head())
Name Inaugural Address \
0 George Washington First Inaugural Address
1 George Washington Second Inaugural Address
2 John Adams Inaugural Address
3 Thomas Jefferson First Inaugural Address
4 Thomas Jefferson Second Inaugural Address
Date text
0 Thursday, April 30, 1789 Fellow-Citizens of the Sena...
1 Monday, March 4, 1793 Fellow Citizens: I AM again...
2 Saturday, March 4, 1797 WHEN it was first perceived...
3 Wednesday, March 4, 1801 Friends and Fellow-Citizens...
4 Monday, March 4, 1805 PROCEEDING, fellow-citizens...
[a-zA-Z]: Semua huruf[^a-zA-Z]: Semua selain hurufspeech_df['text'] = speech_df['text']\
.str.replace('[^a-zA-Z]', ' ')
Sebelum:
"Fellow-Citizens of the Senate and of the House of
Representatives: AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
Sesudah:
"Fellow Citizens of the Senate and of the House of
Representatives AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
speech_df['text'] = speech_df['text'].str.lower()
print(speech_df['text'][0])
"fellow citizens of the senate and of the house of
representatives among the vicissitudes incident to
life no event could have filled me with greater"...
speech_df['char_cnt'] = speech_df['text'].str.len()
print(speech_df['char_cnt'].head())
0 1889
1 806
2 2408
3 1495
4 2465
Name: char_cnt, dtype: int64
speech_df['word_cnt'] =
speech_df['text'].str.split()
speech_df['word_cnt'].head(1)
['fellow', 'citizens', 'of', 'the', 'senate', 'and',...
speech_df['word_counts'] =
speech_df['text'].str.split().str.len()
print(speech_df['word_splits'].head())
0 1432
1 135
2 2323
3 1736
4 2169
Name: word_cnt, dtype: int64
speech_df['avg_word_len'] =
speech_df['char_cnt'] / speech_df['word_cnt']
Rekayasa Fitur untuk Machine Learning di Python