Putting it all together

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Case study: civil liberties in Africa

head(africa)
  year      country gdp_pc  infl trade    civlib population
1 1972 Burkina Faso    377 -2.92 29.69 0.5000000    5848380
2 1973 Burkina Faso    376  7.60 31.31 0.5000000    5958700
3 1974 Burkina Faso    393  8.72 35.22 0.3333333    6075700
4 1975 Burkina Faso    416 18.76 40.11 0.3333333    6202000
5 1976 Burkina Faso    435 -8.40 37.76 0.5000000    6341030
6 1977 Burkina Faso    448 29.99 41.11 0.6666667    6486870
1 Data source: https://scholar.harvard.edu/rbates/data
Handling Missing Data with Imputations in R

Modeling incomplete data

Goal: investigate the relation between the civil liberties, civlib, and GDP per capita, gdp_pc.

  1. Visualize incomplete data.
    • Which variables are missing?
    • What might be the missing data mechanisms?
  2. Impute missing data and inspect imputation quality.
  3. Run a model on imputed data, accounting for imputation uncertainty.
Handling Missing Data with Imputations in R

What you will need

  • aggr()
  • spineMiss()
  • mice() - with() - pool()
Handling Missing Data with Imputations in R

Assessing imputation quality with MICE

  • mice() produces multiple imputed data sets.
  • Visualizing each of them with VIM's functions could be cumbersome.
  • The mice package offers its own plots that automatically handle multiple data sets.
nhanes_multiimp <- mice(nhanes, m = 5, defaultMethod = "pmm")
stripplot(nhanes_multiimp, 
          Weight ~ Height | .imp,
          pch = 20, cex = 2)
Handling Missing Data with Imputations in R

Strip plot

A grid of six scatter plots of Height vs Weight. Each plot highlights the imputed values in color. The imputed values are close to the observed values, making them indistinguishable but for the color.

Handling Missing Data with Imputations in R

Let's put what you've learned to practice!

Handling Missing Data with Imputations in R

Preparing Video For Download...