Omgaan met missende data met imputaties in R
Michal Oleszak
Machine Learning Engineer

De meeste statistische modellen schatten de conditionele verdeling van de respons:
$p(y|X)$
Voor één voorspelling vat je die verdeling samen:
In plaats daarvan kun je uit deze verdelingen trekken om de variatie te vergroten.


Taak: imputeer PhysActive uit nhanes met logistische regressie.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
Taak: imputeer PhysActive uit nhanes met logistische regressie.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
Taak: imputeer PhysActive uit nhanes met logistische regressie.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
Taak: imputeer PhysActive uit nhanes met logistische regressie.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
Taak: imputeer PhysActive uit nhanes met logistische regressie.
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variatie van geïmputeerde data:
table(preds[missing_physactive])
1
26
Variatie van geobserveerde PhysActive-data:
table(nhanes$PhysActive)
0 1
181 610
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- ifelse(preds >= 0.5, 1, 0)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
nhanes_imp <- hotdeck(nhanes)
missing_physactive <- is.na(nhanes$PhysActive)
logreg_model <- glm(PhysActive ~ Age + Weight + Pulse,
data = nhanes_imp, family = binomial)
preds <- predict(logreg_model, type = "response")
preds <- rbinom(length(preds), size = 1, prob = preds)
nhanes_imp[missing_physactive, "PhysActive"] <- preds[missing_physactive]
Variatie van geïmputeerde data:
table(preds[missing_physactive])
0 1
5 21
Variatie van geobserveerde PhysActive-data:
table(nhanes$PhysActive)
0 1
181 610
Omgaan met missende data met imputaties in R