Addressing missing data

Analisi esplorativa dei dati in Python

George Boorman

Curriculum Manager, DataCamp

Why is missing data a problem?

  • Affects distributions
    • Missing heights of taller students
  • Less representative of the population
    • Certain groups disproportionately represented, e.g., lacking data on oldest students
  • Can result in drawing incorrect conclusions

Sample versus population height distribution, showing the sample has a lower maximum and mean

Analisi esplorativa dei dati in Python

Data professionals' job data

Column Description Data type
Working_Year Year the data was obtained Float
Designation Job title String
Experience Experience level e.g., "Mid", "Senior" String
Employment_Status Type of employment contract e.g., "FT", "PT" String
Employee_Location Country of employment String
Company_Size Labels for company size e.g., "S", "M", "L" String
Remote_Working_Ratio Percentage of time working remotely Integer
Salary_USD Salary in US dollars Float
Analisi esplorativa dei dati in Python

Salary by experience level

Box plot of salaries by experience level using the clean dataset, with an upper limit near 600000 dollars

Box plot of salaries by experience level using a dataset with missing values, showing an upper limit near 450000 dollars

Analisi esplorativa dei dati in Python

Checking for missing values

print(salaries.isna().sum())
Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
dtype: int64
Analisi esplorativa dei dati in Python

Strategies for addressing missing data

  • Drop missing values
    • 5% or less of total values
  • Impute mean, median, mode
    • Depends on distribution and context
  • Impute by sub-group
    • Different experience levels have different median salary
Analisi esplorativa dei dati in Python

Dropping missing values

threshold = len(salaries) * 0.05
print(threshold)
30
Analisi esplorativa dei dati in Python

Dropping missing values

cols_to_drop = salaries.columns[salaries.isna().sum() <= threshold]

print(cols_to_drop)
Index(['Working_Year', 'Designation', 'Employee_Location',
       'Remote_Working_Ratio'],
      dtype='object')
salaries.dropna(subset=cols_to_drop, inplace=True)
Analisi esplorativa dei dati in Python

Imputing a summary statistic

cols_with_missing_values = salaries.columns[salaries.isna().sum() > 0]
print(cols_with_missing_values)
Index(['Experience', 'Employment_Status', 'Company_Size', 'Salary_USD'], 
    dtype='object')
for col in cols_with_missing_values[:-1]:
    salaries[col].fillna(salaries[col].mode()[0])
Analisi esplorativa dei dati in Python

Checking the remaining missing values

print(salaries.isna().sum())
Working_Year             0
Designation              0
Experience               0
Employment_Status        0
Employee_Location        0
Company_Size             0
Remote_Working_Ratio     0
Salary_USD              41
Analisi esplorativa dei dati in Python

Imputing by sub-group

salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()

print(salaries_dict)
{'Entry': 55380.0, 'Executive': 135439.0, 'Mid': 74173.5, 'Senior': 128903.0}
Analisi esplorativa dei dati in Python

Imputing by sub-group

salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
Analisi esplorativa dei dati in Python

No more missing values!

print(salaries.isna().sum())
Working_Year            0
Designation             0
Experience              0
Employment_Status       0
Employee_Location       0
Company_Size            0
Remote_Working_Ratio    0
Salary_USD              0
dtype: int64
Analisi esplorativa dei dati in Python

Let's practice!

Analisi esplorativa dei dati in Python

Preparing Video For Download...