Missing data: what can go wrong

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

What you will learn

After finishing this course, you will be able to:

Understand why missing data require special treatment.
Use statistical tests and visualization tools to detect patterns in missing data.
Perform imputation with a collection of statistical and machine learning models.
Incorporate uncertainty from imputation into your analyses and predictions, making them more robust.

Prerequisites

The course assumes you are comfortable with the following topics:

Basic data manipulations with dplyr and the pipe operator (%>%).
Linear and logistic regression models (lm(), glm()).
Basic probability concepts: random variables, distributions.

Missing data primer

Obviously the best way to treat missing data is not to have them.

Unfortunately, missing data are everywhere:

Nonresponse in surveys.
Technical issues with data-collecting equipment.
Joining data from different sources.
...

We have to stay watchful for missing data.

¹ Orchard, T., and M. A. Woodbury. 1972. “A Missing Information Principle: Theory and Applications.” In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1:697–715.

NHANES data

head(nhanes, 3)

  Age Gender Weight Height Diabetes TotChol Pulse PhysActive
1  16   male   73.2  172.0    FALSE    3.00    76       TRUE
2  17   male   72.3  176.0    FALSE    2.61    74       TRUE
3  12   male   57.7  158.9    FALSE    4.27    80       TRUE

nhanes %>% is.na() %>% colSums()

Age     Gender     Weight     Height   Diabetes    TotChol    Pulse   PhysActive 
0       0          9          8        1            85        32      26

Linear regression with incomplete data

model_1 <- lm(Diabetes ~ Age + Weight, 
              data = nhanes)

Parts of summary(model_1):

Residual standard error: 0.08571 on 804 
degrees of freedom (10 observations 
deleted due to missingness)

Adjusted R-squared:  0.005706 
F-statistic: 3.313 on 2 and 804 DF,  
p-value: 0.03691

model_2 <- lm(Diabetes ~ Age + Weight +
              TotChol, data = nhanes)

Parts of summary(model_2):

Residual standard error: 0.08264 on 718 
degrees of freedom (95 observations
deleted due to missingness)

Adjusted R-squared:  0.008422 
F-statistic: 3.041 on 3 and 718 DF,
p-value: 0.02834

Main takeaways

Missing data is sometimes ignored silently by statistical software.
As a result, it might be impossible to compare different models.
Simply dropping all incomplete observations might lead to biased results.
Missing data, if present, have to be addressed appropriately.

Let's practice!

Handling Missing Data with Imputations in R