Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
Imputation = making an educated guess about what the missing values might be
This chapter focuses on donor-based methods:
Mean imputation works well for time-series data that randomly fluctuate around a long-term average.
For cross-sectional data, mean imputation is often a very poor choice:
Task: mean-impute Height
and Weight
in NHANES data.
nhanes <- nhanes %>%
mutate(Height_imp = ifelse(is.na(Height), TRUE, FALSE)) %>%
mutate(Weight_imp = ifelse(is.na(Weight), TRUE, FALSE))
Height
and Weight
with their respective means.nhanes_imp <- nhanes %>%
mutate(Height = ifelse(is.na(Height), mean(Height, na.rm = TRUE), Height)) %>%
mutate(Weight = ifelse(is.na(Weight), mean(Weight, na.rm = TRUE), Weight))
nhanes_imp %>%
select(Weight, Height, Height_imp, Weight_imp) %>%
head()
Weight Height Height_imp Weight_imp
1 73.20000 166.2499 TRUE FALSE
2 72.30000 166.2499 TRUE FALSE
3 57.70000 158.9000 FALSE FALSE
4 88.90000 183.3000 FALSE FALSE
5 45.10000 157.6000 FALSE FALSE
6 66.77065 158.4000 FALSE TRUE
nhanes_imp %>% select(Weight, Height, Height_imp, Weight_imp) %>% marginplot(delimiter="imp")
Destroying relation between variables:
Height
and Weight
, their positive correlation is weaker.No variability in imputed data:
Handling Missing Data with Imputations in R