Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
Pros:
Allows for sensitivity analysis towards MNAR data.
Cons:
Impute nhanes
20 times:
library(mice)
nhanes_multiimp <- mice(nhanes, m = 20)
Fit a linear regression model to each imputed data set:
lm_multiimp <- with(nhanes_multiimp, lm(Weight ~ Height + TotChol + PhysActive))
Pool regression results:
lm_pooled <- pool(lm_multiimp)
summary(lm_pooled, conf.int = TRUE, conf.level = 0.95)
estimate std.error statistic df p.value 2.5 % 97.5 %
(Intercept) -122.964 10.933 -11.247 735.389 0.000 -144.428 -101.500
Height 1.086 0.060 18.158 796.106 0.000 0.968 1.203
TotChol 2.653 0.884 3.003 305.460 0.003 0.915 4.392
PhysActive -1.746 1.422 -1.228 733.957 0.220 -4.536 1.045
mice()
takes an argument defaultMethod
: a vector of 4 strings, specifying methods for:
nhanes_multiimp <- mice(nhanes, m = 20,
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
The predictorMatrix
governs which variables are used to impute other variables.
nhanes_multiimp <- mice(nhanes, m = 20)
nhanes_multiimp$predictorMatrix
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
Age 0 1 1 1 1 1 1 1
Gender 1 0 1 1 1 1 1 1
Weight 1 1 0 1 1 1 1 1
Height 1 1 1 0 1 1 1 1
Diabetes 1 1 1 1 0 1 1 1
TotChol 1 1 1 1 1 0 1 1
Pulse 1 1 1 1 1 1 0 1
PhysActive 1 1 1 1 1 1 1 0
pred_mat <- quickpred(nhanes, mincor = 0.25)
nhanes_multiimp <- mice(nhanes, m = 20, predictorMatrix = pred_mat)
print(pred_mat)
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
Age 0 0 0 0 0 0 0 0
Gender 0 0 0 0 0 0 0 0
Weight 1 1 0 0 0 0 1 0
...
Handling Missing Data with Imputations in R