Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
Missing data problems can be classified into three categories. Distinguishing between them is vital because each category requires a different solution.
Locations of missing values in the dataset are purely random, they do not depend on any other data.
Example:
A weather sensor is measuring temperature and sending the data to a database. There are some missing entries in the database for when the sensor broke down.
Locations of missing values in the dataset depend on some other, observed data.
Example:
There are some missing temperature values in the database for when the sensor was switched off for maintenance. As the maintenance team never work on the weekends, the locations of missing values depend on the day of the week.
Locations of missing values in the dataset depend on the missing values themselves.
Example:
When it's extremely cold, the weather sensor freezes and stops working. So, it does not record very low temperatures. Thus, the locations of missing values in the temperature variable depend on the values of this variable themselves.
What if we simply drop incomplete observations?
Example: t-test for difference in means
p-value small → reject the null hypothesis → means are different
p-value large → don't reject the null hypothesis → means are equal
Goal: test if the percentage of missing values in one variable differs for different values of another variable.
Example: is the percentage of missing values in PhysActive
different for males and females?
Testing procedure:
PhysActive
is missing.nhanes <- nhanes %>%
mutate(missing_phys_active = is.na(PhysActive))
missing_phys_active_male <- nhanes %>%
filter(Gender == "male") %>%
pull(missing_phys_active)
missing_phys_active_female <- nhanes %>%
filter(Gender == "female") %>%
pull(missing_phys_active)
t.test(missing_phys_active_female, missing_phys_active_male)
Welch Two Sample t-test
data: missing_phys_active_female and missing_phys_active_male
t = -1.7192, df = 781.18, p-value = 0.08597
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.044414688 0.002940477
sample estimates:
mean of x mean of y
0.02083333 0.04157044
Handling Missing Data with Imputations in R