Feature Engineering for Machine Learning in Python
Robert O'Callaghan
Director of Data Science, Ordergroove
Chapter 1: Feature creation and extraction
Chapter 2: Engineering messy data
Chapter 3: Feature normalization
Chapter 4: Working with text features
import pandas as pd
df = pd.read_csv(path_to_csv_file)
print(df.head())
SurveyDate \
0 2018-02-28 20:20:00
1 2018-06-28 13:26:00
2 2018-06-06 03:37:00
3 2018-05-09 01:06:00
4 2018-04-12 22:41:00
FormalEducation
0 Bachelor's degree (BA. BS. B.Eng.. etc.)
1 Bachelor's degree (BA. BS. B.Eng.. etc.)
2 Bachelor's degree (BA. BS. B.Eng.. etc.)
3 Some college/university study ...
4 Bachelor's degree (BA. BS. B.Eng.. etc.)
print(df.columns)
Index(['SurveyDate', 'FormalEducation',
'ConvertedSalary', 'Hobby', 'Country',
'StackOverflowJobsRecommend', 'VersionControl',
'Age', 'Years Experience', 'Gender',
'RawSalary'], dtype='object')
print(df.dtypes)
SurveyDate object
FormalEducation object
ConvertedSalary float64
...
Years Experience int64
Gender object
RawSalary object
dtype: object
only_ints = df.select_dtypes(include=['int'])
print(only_ints.columns)
Index(['Age', 'Years Experience'], dtype='object')
Feature Engineering for Machine Learning in Python