Dealing with missing values (I)

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Listwise deletion

      SurveyDate      ConvertedSalary     Hobby ... \
0  2/28/18 20:20                  NaN       Yes ...
1  6/28/18 13:26              70841.0       Yes ...
2    6/6/18 3:37                  NaN        No ...
3    5/9/18 1:06              21426.0       Yes ...
4  4/12/18 22:41              41671.0       Yes ...
Feature Engineering for Machine Learning in Python

Listwise deletion in Python

# Drop all rows with at least one missing values
df.dropna(how='any')
Feature Engineering for Machine Learning in Python

Listwise deletion in Python

# Drop rows with missing values in a specific column
df.dropna(subset=['VersionControl'])
Feature Engineering for Machine Learning in Python

Issues with deletion

  • It deletes valid data points
  • Relies on randomness
  • Reduces information
Feature Engineering for Machine Learning in Python

Replacing with strings

# Replace missing values in a specific column
# with a given string
df['VersionControl'].fillna(
    value='None Given', inplace=True
)
Feature Engineering for Machine Learning in Python

Recording missing values

# Record where the values are not missing
df['SalaryGiven'] = df['ConvertedSalary'].notnull()
# Drop a specific column
df.drop(columns=['ConvertedSalary'])
Feature Engineering for Machine Learning in Python

Practice time

Feature Engineering for Machine Learning in Python

Preparing Video For Download...