Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
Most statistical models estimate the conditional distribution of the response variable:
$p(y|X)$
To make a single prediction, the conditional distribution is summarized:
Instead, we can draw from these distributions to increase variability.
Task: impute PhysActive
from nhanes
data with logistic regression.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
Task: impute PhysActive
from nhanes
data with logistic regression.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
Task: impute PhysActive
from nhanes
data with logistic regression.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
Task: impute PhysActive
from nhanes
data with logistic regression.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
Task: impute PhysActive
from nhanes
data with logistic regression.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variability of imputed data:
table(preds[missing_physactive])
1
26
Variability of observed PhysActive
data:
table(nhanes$PhysActive)
0 1
181 610
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- rbinom(length(preds), size = 1, prob = preds)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variability of imputed data:
table(preds[missing_physactive])
0 1
5 21
Variability of observed PhysActive
data:
table(nhanes$PhysActive)
0 1
181 610
Handling Missing Data with Imputations in R