Addressing missing data

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Why is missing data a problem?

  • Affects distributions
    • Missing heights of taller students
  • Less representative of the population
    • Certain groups disproportionately represented, e.g., lacking data on oldest students
  • Can result in drawing incorrect conclusions

Sample versus population height distribution, showing the sample has a lower maximum and mean

Exploratory Data Analysis in Python

Data professionals' job data

Column Description Data type
Working_Year Year the data was obtained Float
Designation Job title String
Experience Experience level e.g., "Mid", "Senior" String
Employment_Status Type of employment contract e.g., "FT", "PT" String
Employee_Location Country of employment String
Company_Size Labels for company size e.g., "S", "M", "L" String
Remote_Working_Ratio Percentage of time working remotely Integer
Salary_USD Salary in US dollars Float
Exploratory Data Analysis in Python

Salary by experience level

Box plot of salaries by experience level using the clean dataset, with an upper limit near 600000 dollars

Box plot of salaries by experience level using a dataset with missing values, showing an upper limit near 450000 dollars

Exploratory Data Analysis in Python

Checking for missing values

print(salaries.isna().sum())
Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
dtype: int64
Exploratory Data Analysis in Python

Strategies for addressing missing data

  • Drop missing values
    • 5% or less of total values
  • Impute mean, median, mode
    • Depends on distribution and context
  • Impute by sub-group
    • Different experience levels have different median salary
Exploratory Data Analysis in Python

Dropping missing values

threshold = len(salaries) * 0.05
print(threshold)
30
Exploratory Data Analysis in Python

Dropping missing values

cols_to_drop = salaries.columns[salaries.isna().sum() <= threshold]

print(cols_to_drop)
Index(['Working_Year', 'Designation', 'Employee_Location',
       'Remote_Working_Ratio'],
      dtype='object')
salaries.dropna(subset=cols_to_drop, inplace=True)
Exploratory Data Analysis in Python

Imputing a summary statistic

cols_with_missing_values = salaries.columns[salaries.isna().sum() > 0]
print(cols_with_missing_values)
Index(['Experience', 'Employment_Status', 'Company_Size', 'Salary_USD'], 
    dtype='object')
for col in cols_with_missing_values[:-1]:
    salaries[col].fillna(salaries[col].mode()[0])
Exploratory Data Analysis in Python

Checking the remaining missing values

print(salaries.isna().sum())
Working_Year             0
Designation              0
Experience               0
Employment_Status        0
Employee_Location        0
Company_Size             0
Remote_Working_Ratio     0
Salary_USD              41
Exploratory Data Analysis in Python

Imputing by sub-group

salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()

print(salaries_dict)
{'Entry': 55380.0, 'Executive': 135439.0, 'Mid': 74173.5, 'Senior': 128903.0}
Exploratory Data Analysis in Python

Imputing by sub-group

salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
Exploratory Data Analysis in Python

No more missing values!

print(salaries.isna().sum())
Working_Year            0
Designation             0
Experience              0
Employment_Status       0
Employee_Location       0
Company_Size            0
Remote_Working_Ratio    0
Salary_USD              0
dtype: int64
Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...