Python ile Machine Learning için Özellik Mühendisliği
Robert O'Callaghan
Director of Data Science, Ordergroove
Serbest metin örneği:
Senato ve Temsilciler Meclisi’nin Yurttaşları: Yaşamın değişkenlikleri arasında, emrinizle bildirimi iletilen ve bu ayın ... gününde alınan olaydan daha büyük kaygı vereni olamazdı.
print(speech_df.head())
Name Inaugural Address \
0 George Washington First Inaugural Address
1 George Washington Second Inaugural Address
2 John Adams Inaugural Address
3 Thomas Jefferson First Inaugural Address
4 Thomas Jefferson Second Inaugural Address
Date text
0 Thursday, April 30, 1789 Fellow-Citizens of the Sena...
1 Monday, March 4, 1793 Fellow Citizens: I AM again...
2 Saturday, March 4, 1797 WHEN it was first perceived...
3 Wednesday, March 4, 1801 Friends and Fellow-Citizens...
4 Monday, March 4, 1805 PROCEEDING, fellow-citizens...
[a-zA-Z]: Tüm harf karakterleri[^a-zA-Z]: Harf olmayan tüm karakterlerspeech_df['text'] = speech_df['text']\
.str.replace('[^a-zA-Z]', ' ')
Önce:
"Fellow-Citizens of the Senate and of the House of
Representatives: AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
Sonra:
"Fellow Citizens of the Senate and of the House of
Representatives AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
speech_df['text'] = speech_df['text'].str.lower()
print(speech_df['text'][0])
"fellow citizens of the senate and of the house of
representatives among the vicissitudes incident to
life no event could have filled me with greater"...
speech_df['char_cnt'] = speech_df['text'].str.len()
print(speech_df['char_cnt'].head())
0 1889
1 806
2 2408
3 1495
4 2465
Name: char_cnt, dtype: int64
speech_df['word_cnt'] =
speech_df['text'].str.split()
speech_df['word_cnt'].head(1)
['fellow', 'citizens', 'of', 'the', 'senate', 'and',...
speech_df['word_counts'] =
speech_df['text'].str.split().str.len()
print(speech_df['word_splits'].head())
0 1432
1 135
2 2323
3 1736
4 2169
Name: word_cnt, dtype: int64
speech_df['avg_word_len'] =
speech_df['char_cnt'] / speech_df['word_cnt']
Python ile Machine Learning için Özellik Mühendisliği