Why generate features?

Feature Engineering per il Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Feature Engineering

Feature Engineering per il Machine Learning in Python

Different types of data

  • Continuous: either integers (or whole numbers) or floats (decimals)
  • Categorical: one of a limited set of values, e.g. gender, country of birth
  • Ordinal: ranked values, often with no detail of distance between them
  • Boolean: True/False values
  • Datetime: dates and times
Feature Engineering per il Machine Learning in Python

Course structure

  • Chapter 1: Feature creation and extraction

  • Chapter 2: Engineering messy data

  • Chapter 3: Feature normalization

  • Chapter 4: Working with text features

Feature Engineering per il Machine Learning in Python

Pandas

import pandas as pd  
df = pd.read_csv(path_to_csv_file)
print(df.head())
Feature Engineering per il Machine Learning in Python

Dataset

              SurveyDate  \
0    2018-02-28 20:20:00     
1    2018-06-28 13:26:00     
2    2018-06-06 03:37:00     
3    2018-05-09 01:06:00     
4    2018-04-12 22:41:00    

                              FormalEducation
0    Bachelor's degree (BA. BS. B.Eng.. etc.)
1    Bachelor's degree (BA. BS. B.Eng.. etc.)
2    Bachelor's degree (BA. BS. B.Eng.. etc.)
3    Some college/university study  ...
4    Bachelor's degree (BA. BS. B.Eng.. etc.)
Feature Engineering per il Machine Learning in Python

Column names

print(df.columns)
Index(['SurveyDate', 'FormalEducation',
       'ConvertedSalary', 'Hobby', 'Country',
       'StackOverflowJobsRecommend', 'VersionControl', 
       'Age', 'Years Experience', 'Gender', 
       'RawSalary'], dtype='object')
Feature Engineering per il Machine Learning in Python

Column types

print(df.dtypes)
SurveyDate                            object
FormalEducation                       object
ConvertedSalary                      float64
...
Years Experience                       int64
Gender                                object
RawSalary                             object
dtype: object
Feature Engineering per il Machine Learning in Python

Selecting specific data types

only_ints = df.select_dtypes(include=['int'])
print(only_ints.columns)
Index(['Age', 'Years Experience'], dtype='object')
Feature Engineering per il Machine Learning in Python

Lets get going!

Feature Engineering per il Machine Learning in Python

Preparing Video For Download...