Missing data mechanisms

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Missing Data Mechanisms: overview

Missing data problems can be classified into three categories. Distinguishing between them is vital because each category requires a different solution.

Missing Completely at Random (MCAR).
Missing at Random (MAR).
Missing not at Random (MNAR).

Missing Completely at Random (MCAR)

Locations of missing values in the dataset are purely random, they do not depend on any other data.

Example:

A weather sensor is measuring temperature and sending the data to a database. There are some missing entries in the database for when the sensor broke down.

Missing at Random (MAR)

Locations of missing values in the dataset depend on some other, observed data.

Example:

There are some missing temperature values in the database for when the sensor was switched off for maintenance. As the maintenance team never work on the weekends, the locations of missing values depend on the day of the week.

Missing not at Random (MNAR)

Locations of missing values in the dataset depend on the missing values themselves.

Example:

When it's extremely cold, the weather sensor freezes and stops working. So, it does not record very low temperatures. Thus, the locations of missing values in the temperature variable depend on the values of this variable themselves.

Handling the mechanisms

What if we simply drop incomplete observations?

If the data are MCAR, removing them results in an information loss.
If the data are MAR or MNAR, removing them introduces bias to models built on these data.
In this case, missing values should be imputed.
Many imputation methods assume MAR, so it's important to detect it.

Statistical testing

Example: t-test for difference in means

Make an assumption (null hypothesis): the means are equal.
Compute the test statistic from your data.
Compute the p-value: how likely it is to obtain the test statistic that you got, assuming the null hypothesis is true?

p-value small → reject the null hypothesis → means are different

p-value large → don't reject the null hypothesis → means are equal

Testing for MAR

Goal: test if the percentage of missing values in one variable differs for different values of another variable.

Example: is the percentage of missing values in PhysActive different for males and females?

Testing procedure:

Create a dummy variable denoting whether PhysActive is missing.
Use a t-test to check if the mean of this dummy is different for males and females.
If the p-value is small (e.g. < 0.05), the means are different, so the data are MAR.

Testing in practice

nhanes <- nhanes %>% 
  mutate(missing_phys_active = is.na(PhysActive))

missing_phys_active_male <- nhanes %>% 
  filter(Gender == "male") %>% 
  pull(missing_phys_active)

missing_phys_active_female <- nhanes %>% 
  filter(Gender == "female") %>% 
  pull(missing_phys_active)

Interpreting test results

t.test(missing_phys_active_female, missing_phys_active_male)

    Welch Two Sample t-test

data:  missing_phys_active_female and missing_phys_active_male
t = -1.7192, df = 781.18, p-value = 0.08597
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.044414688  0.002940477
sample estimates:
 mean of x  mean of y 
0.02083333 0.04157044

Let's practice recognizing missing data mechanisms!

Handling Missing Data with Imputations in R