Feature Engineering for Machine Learning in Python
Robert O'Callaghan
Director of Data Science, Ordergroove
Example of free text:
Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the th day of the present month.
print(speech_df.head())
Name Inaugural Address \
0 George Washington First Inaugural Address
1 George Washington Second Inaugural Address
2 John Adams Inaugural Address
3 Thomas Jefferson First Inaugural Address
4 Thomas Jefferson Second Inaugural Address
Date text
0 Thursday, April 30, 1789 Fellow-Citizens of the Sena...
1 Monday, March 4, 1793 Fellow Citizens: I AM again...
2 Saturday, March 4, 1797 WHEN it was first perceived...
3 Wednesday, March 4, 1801 Friends and Fellow-Citizens...
4 Monday, March 4, 1805 PROCEEDING, fellow-citizens...
[a-zA-Z]
: All letter characters[^a-zA-Z]
: All non letter charactersspeech_df['text'] = speech_df['text']\
.str.replace('[^a-zA-Z]', ' ')
Before:
"Fellow-Citizens of the Senate and of the House of
Representatives: AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
After:
"Fellow Citizens of the Senate and of the House of
Representatives AMONG the vicissitudes incident to
life no event could have filled me with greater" ...
speech_df['text'] = speech_df['text'].str.lower()
print(speech_df['text'][0])
"fellow citizens of the senate and of the house of
representatives among the vicissitudes incident to
life no event could have filled me with greater"...
speech_df['char_cnt'] = speech_df['text'].str.len()
print(speech_df['char_cnt'].head())
0 1889
1 806
2 2408
3 1495
4 2465
Name: char_cnt, dtype: int64
speech_df['word_cnt'] =
speech_df['text'].str.split()
speech_df['word_cnt'].head(1)
['fellow', 'citizens', 'of', 'the', 'senate', 'and',...
speech_df['word_counts'] =
speech_df['text'].str.split().str.len()
print(speech_df['word_splits'].head())
0 1432
1 135
2 2323
3 1736
4 2169
Name: word_cnt, dtype: int64
speech_df['avg_word_len'] =
speech_df['char_cnt'] / speech_df['word_cnt']
Feature Engineering for Machine Learning in Python