Missing data: what can go wrong

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

What you will learn

After finishing this course, you will be able to:

  • Understand why missing data require special treatment.
  • Use statistical tests and visualization tools to detect patterns in missing data.
  • Perform imputation with a collection of statistical and machine learning models.
  • Incorporate uncertainty from imputation into your analyses and predictions, making them more robust.
Handling Missing Data with Imputations in R

Prerequisites

The course assumes you are comfortable with the following topics:

  • Basic data manipulations with dplyr and the pipe operator (%>%).
  • Linear and logistic regression models (lm(), glm()).
  • Basic probability concepts: random variables, distributions.
Handling Missing Data with Imputations in R

Missing data primer

Obviously the best way to treat missing data is not to have them.

Unfortunately, missing data are everywhere:

  • Nonresponse in surveys.
  • Technical issues with data-collecting equipment.
  • Joining data from different sources.
  • ...

We have to stay watchful for missing data.

1 Orchard, T., and M. A. Woodbury. 1972. “A Missing Information Principle: Theory and Applications.” In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1:697–715.
Handling Missing Data with Imputations in R

NHANES data

head(nhanes, 3)
  Age Gender Weight Height Diabetes TotChol Pulse PhysActive
1  16   male   73.2  172.0    FALSE    3.00    76       TRUE
2  17   male   72.3  176.0    FALSE    2.61    74       TRUE
3  12   male   57.7  158.9    FALSE    4.27    80       TRUE
nhanes %>% is.na() %>% colSums()
Age     Gender     Weight     Height   Diabetes    TotChol    Pulse   PhysActive 
0       0          9          8        1            85        32      26
Handling Missing Data with Imputations in R

Linear regression with incomplete data

model_1 <- lm(Diabetes ~ Age + Weight, 
              data = nhanes)

Parts of summary(model_1):

Residual standard error: 0.08571 on 804 
degrees of freedom (10 observations 
deleted due to missingness)

Adjusted R-squared:  0.005706 
F-statistic: 3.313 on 2 and 804 DF,  
p-value: 0.03691
model_2 <- lm(Diabetes ~ Age + Weight +
              TotChol, data = nhanes)

Parts of summary(model_2):

Residual standard error: 0.08264 on 718 
degrees of freedom (95 observations
deleted due to missingness)

Adjusted R-squared:  0.008422 
F-statistic: 3.041 on 3 and 718 DF,
p-value: 0.02834
Handling Missing Data with Imputations in R

Main takeaways

  • Missing data is sometimes ignored silently by statistical software.
  • As a result, it might be impossible to compare different models.
  • Simply dropping all incomplete observations might lead to biased results.
  • Missing data, if present, have to be addressed appropriately.
Handling Missing Data with Imputations in R

Let's practice!

Handling Missing Data with Imputations in R

Preparing Video For Download...