Model-based imputation approach

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Model-based imputation

  • Impute each variable with a different statistical model.
  • Ability to account for relations in the data that we know of.
Handling Missing Data with Imputations in R

Model-based imputation procedure

  • Loop over variables.
  • For each variable, create a model that explains it.
  • Use the model to predict missing values.
  • Iterate through variables imputing the locations where data were originally missing.
Handling Missing Data with Imputations in R

Model-based imputation step by step

A data frame with four variables (A, B, C and D) and five rows, filled with dummy data. Two of the variables (A and C) have two missing values each, all in different rows.

Handling Missing Data with Imputations in R

Model-based imputation step by step

A data frame with four variables (A, B, C and D) and five rows, filled with dummy data. The missing values in A have been imputed.

  1. Predict missing values in A.
Handling Missing Data with Imputations in R

Model-based imputation step by step

A data frame with four variables (A, B, C and D) and five rows, filled with dummy data. Missing values in both A and C have been imputed.

  1. Predict missing values in A.
  2. Treat data imputed in A as observed and predict missing values in C.
Handling Missing Data with Imputations in R

Model-based imputation step by step

A data frame with four variables (A, B, C and D) and five rows, filled with dummy data. Missing values in C have been imputed.

  1. Predict missing values in A.
  2. Treat data imputed in A as observed and predict missing values in C.
  3. Treat data imputed in C as observed and predict A again where it was originally missing.
  4. Continue until convergence.
Handling Missing Data with Imputations in R

How to choose the model

The model for each variable depends on the type of this variable:

  • Continuous variables - linear regression
  • Binary variables - logistic regression
  • Categorical variables - multinomial logistic regression
  • Count variables - Poisson regression
Handling Missing Data with Imputations in R

Single linear regression imputation

Impute Height and Weight in nhanes with a linear model:

library(simputation)
nhanes_imp <- impute_lm(nhanes, Height + Weight ~ .)

Check if they were indeed imputed:

nhanes_imp %>% 
  is.na() %>% 
  colSums()
Age     Gender     Weight     Height   Diabetes    TotChol      Pulse PhysActive 
  0          0         32         30          1         85         32         26
Handling Missing Data with Imputations in R

Linear regression imputation in practice

Initialize missing values with hotdeck and save missing locations:

nhanes_imp <- hotdeck(nhanes)
missing_height <- nhanes_imp$Height_imp
missing_weight <- nhanes_imp$Weight_imp

Iterate over Height and Weight 5 times, imputing them at originally missing locations:

for (i in 1:5) {
  nhanes_imp$Height[missing_height] <- NA
  nhanes_imp <- impute_lm(nhanes_imp, Height ~ Age + Gender + Weight)
  nhanes_imp$Weight[missing_weight] <- NA
  nhanes_imp <- impute_lm(nhanes_imp, Weight ~ Age + Gender + Height)
}
Handling Missing Data with Imputations in R

Detecting convergence



for (i in 1:5) {

  nhanes_imp$Height[missing_height] <- NA
  nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
  nhanes_imp$Weight[missing_weight] <- NA
  nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)


}
Handling Missing Data with Imputations in R

Detecting convergence

diff_height <- c()
diff_weight <- c()
for (i in 1:5) {

  nhanes_imp$Height[missing_height] <- NA
  nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
  nhanes_imp$Weight[missing_weight] <- NA
  nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)


}
Handling Missing Data with Imputations in R

Detecting convergence

diff_height <- c()
diff_weight <- c()
for (i in 1:5) {
  prev_iter <- nhanes_imp
  nhanes_imp$Height[missing_height] <- NA
  nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
  nhanes_imp$Weight[missing_weight] <- NA
  nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)


}
Handling Missing Data with Imputations in R

Detecting convergence

diff_height <- c()
diff_weight <- c()
for (i in 1:5) {
  prev_iter <- nhanes_imp
  nhanes_imp$Height[missing_height] <- NA
  nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
  nhanes_imp$Weight[missing_weight] <- NA
  nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)
  diff_height <- c(diff_height, mapc(prev_iter$Height, nhanes_imp$Height))
  diff_weight <- c(diff_weight, mapc(prev_iter$Weight, nhanes_imp$Weight))
}
Handling Missing Data with Imputations in R

Detecting convergence

A line plot with two lines on it, one for Pulse and one for TotChol variables. The x-axis shows the number of iterations, while the y-axis shows the mean absolute percentage change in imputed values across iterations. After the first iteration both variables see some change. From iteration 2 onwards, they do not change anymore.

Handling Missing Data with Imputations in R

Let's practice linear regression imputation!

Handling Missing Data with Imputations in R

Preparing Video For Download...