Why do missing values exist?

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

How gaps in data occur

  • Data not being collected properly
  • Collection and management errors
  • Data intentionally being omitted
  • Could be created due to transformations of the data
Feature Engineering for Machine Learning in Python

Why we care?

  • Some models cannot work with missing data (Nulls/NaNs)
  • Missing data may be a sign of a wider data issue
  • Missing data can be a useful feature
Feature Engineering for Machine Learning in Python

Missing value discovery

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
 --  ------                      --------------  -----  
 0   SurveyDate                  999 non-null    object 
...  ...                         ...             ...
 8   StackOverflowJobsRecommend  487 non-null    float64
 9   VersionControl              999 non-null    object 
 10  Gender                      693 non-null    object 
 11  RawSalary                   665 non-null    object 
dtypes: float64(2), int64(2), object(8)
memory usage: 93.7+ KB
Feature Engineering for Machine Learning in Python

Finding missing values

print(df.isnull())
   StackOverflowJobsRecommend  VersionControl  ... \ 
0                        True           False  ...
1                       False           False  ...
2                       False           False  ...
3                        True           False  ...
4                       False           False  ...

   Gender  RawSalary
0   False       True
1   False      False
2    True       True
3   False      False
4   False      False
Feature Engineering for Machine Learning in Python

Finding missing values

print(df['StackOverflowJobsRecommend'].isnull().sum())
512
Feature Engineering for Machine Learning in Python

Finding non-missing values

print(df.notnull())
   StackOverflowJobsRecommend  VersionControl  ... \
0                       False            True  ...
1                        True            True  ...
2                        True            True  ...
3                       False            True  ...
4                        True            True  ...

   Gender  RawSalary
0    True      False
1    True       True
2   False      False
3    True       True
4    True       True
Feature Engineering for Machine Learning in Python

Go ahead and find missing values!

Feature Engineering for Machine Learning in Python

Preparing Video For Download...