Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
After finishing this course, you will be able to:
The course assumes you are comfortable with the following topics:
dplyr
and the pipe operator (%>%
).lm()
, glm()
).Obviously the best way to treat missing data is not to have them.
Unfortunately, missing data are everywhere:
We have to stay watchful for missing data.
head(nhanes, 3)
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
1 16 male 73.2 172.0 FALSE 3.00 76 TRUE
2 17 male 72.3 176.0 FALSE 2.61 74 TRUE
3 12 male 57.7 158.9 FALSE 4.27 80 TRUE
nhanes %>% is.na() %>% colSums()
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
0 0 9 8 1 85 32 26
model_1 <- lm(Diabetes ~ Age + Weight,
data = nhanes)
Parts of summary(model_1)
:
Residual standard error: 0.08571 on 804
degrees of freedom (10 observations
deleted due to missingness)
Adjusted R-squared: 0.005706
F-statistic: 3.313 on 2 and 804 DF,
p-value: 0.03691
model_2 <- lm(Diabetes ~ Age + Weight +
TotChol, data = nhanes)
Parts of summary(model_2)
:
Residual standard error: 0.08264 on 718
degrees of freedom (95 observations
deleted due to missingness)
Adjusted R-squared: 0.008422
F-statistic: 3.041 on 3 and 718 DF,
p-value: 0.02834
Handling Missing Data with Imputations in R