Why generate features?

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Feature Engineering

Feature Engineering for Machine Learning in Python

Different types of data

  • Continuous: either integers (or whole numbers) or floats (decimals)
  • Categorical: one of a limited set of values, e.g. gender, country of birth
  • Ordinal: ranked values, often with no detail of distance between them
  • Boolean: True/False values
  • Datetime: dates and times
Feature Engineering for Machine Learning in Python

Course structure

  • Chapter 1: Feature creation and extraction

  • Chapter 2: Engineering messy data

  • Chapter 3: Feature normalization

  • Chapter 4: Working with text features

Feature Engineering for Machine Learning in Python

Pandas

import pandas as pd  
df = pd.read_csv(path_to_csv_file)
print(df.head())
Feature Engineering for Machine Learning in Python

Dataset

              SurveyDate  \
0    2018-02-28 20:20:00     
1    2018-06-28 13:26:00     
2    2018-06-06 03:37:00     
3    2018-05-09 01:06:00     
4    2018-04-12 22:41:00    

                              FormalEducation
0    Bachelor's degree (BA. BS. B.Eng.. etc.)
1    Bachelor's degree (BA. BS. B.Eng.. etc.)
2    Bachelor's degree (BA. BS. B.Eng.. etc.)
3    Some college/university study  ...
4    Bachelor's degree (BA. BS. B.Eng.. etc.)
Feature Engineering for Machine Learning in Python

Column names

print(df.columns)
Index(['SurveyDate', 'FormalEducation',
       'ConvertedSalary', 'Hobby', 'Country',
       'StackOverflowJobsRecommend', 'VersionControl', 
       'Age', 'Years Experience', 'Gender', 
       'RawSalary'], dtype='object')
Feature Engineering for Machine Learning in Python

Column types

print(df.dtypes)
SurveyDate                            object
FormalEducation                       object
ConvertedSalary                      float64
...
Years Experience                       int64
Gender                                object
RawSalary                             object
dtype: object
Feature Engineering for Machine Learning in Python

Selecting specific data types

only_ints = df.select_dtypes(include=['int'])
print(only_ints.columns)
Index(['Age', 'Years Experience'], dtype='object')
Feature Engineering for Machine Learning in Python

Lets get going!

Feature Engineering for Machine Learning in Python

Preparing Video For Download...