Replicating data variability

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Variability in imputed data

A margin plot, which is a scatter plot of "Height" vs "Weight", were values imputed in any of the two variables are highlighted in different color.

  • No variability in imputed data.
  • We would like the imputation to replicate the variability of observed data.
  • In model-based imputation, the same values of predictors result in the same imputed value.
  • Solution: drawing from conditional distributions.
Handling Missing Data with Imputations in R

What is a prediction

Most statistical models estimate the conditional distribution of the response variable:

$p(y|X)$

To make a single prediction, the conditional distribution is summarized:

  • Linear regression: expected value of the conditional distribution.
  • Logistic regression: class with the highest probability.

Instead, we can draw from these distributions to increase variability.

Handling Missing Data with Imputations in R

Drawing from conditional distributions

A plot showing a probability density function of a normal distribution. The mean of 25 is highlighted.

Handling Missing Data with Imputations in R

Drawing from conditional distributions

A table with four columns: prediction probability from logistic regression (0.7 for all rows), a boolean for whether this probability is larger than 0.5 (TRUE for all rows), imputed values based on threshold (1 for all rows) and imputed value drawn from conditional distribution (1 for most rows, but 0 for some.)

Handling Missing Data with Imputations in R

Logistic regression imputation

Task: impute PhysActive from nhanes data with logistic regression.

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
Handling Missing Data with Imputations in R

Logistic regression imputation

Task: impute PhysActive from nhanes data with logistic regression.

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
Handling Missing Data with Imputations in R

Logistic regression imputation

Task: impute PhysActive from nhanes data with logistic regression.

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
Handling Missing Data with Imputations in R

Logistic regression imputation

Task: impute PhysActive from nhanes data with logistic regression.

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
Handling Missing Data with Imputations in R

Logistic regression imputation

Task: impute PhysActive from nhanes data with logistic regression.

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Handling Missing Data with Imputations in R

Logistic regression imputation

Variability of imputed data:

table(preds[missing_physactive])
 1 
26

Variability of observed PhysActive data:

table(nhanes$PhysActive)
  0   1 
181 610
Handling Missing Data with Imputations in R

Drawing from class probabilities

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Handling Missing Data with Imputations in R

Drawing from class probabilities

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")

nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Handling Missing Data with Imputations in R

Drawing from class probabilities

nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse, 
                    data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- rbinom(length(preds), size = 1, prob = preds)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Handling Missing Data with Imputations in R

Drawing from class probabilities

Variability of imputed data:

table(preds[missing_physactive])
0  1 
5 21

Variability of observed PhysActive data:

table(nhanes$PhysActive)
  0   1 
181 610
Handling Missing Data with Imputations in R

Let's practice replicating data variability!

Handling Missing Data with Imputations in R

Preparing Video For Download...