Missing data mechanisms

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Missing Data Mechanisms: overview

Missing data problems can be classified into three categories. Distinguishing between them is vital because each category requires a different solution.

  • Missing Completely at Random (MCAR).
  • Missing at Random (MAR).
  • Missing not at Random (MNAR).
Handling Missing Data with Imputations in R

Missing Completely at Random (MCAR)

Locations of missing values in the dataset are purely random, they do not depend on any other data.

Example:

A weather sensor is measuring temperature and sending the data to a database. There are some missing entries in the database for when the sensor broke down.

Handling Missing Data with Imputations in R

Missing at Random (MAR)

Locations of missing values in the dataset depend on some other, observed data.

Example:

There are some missing temperature values in the database for when the sensor was switched off for maintenance. As the maintenance team never work on the weekends, the locations of missing values depend on the day of the week.

Handling Missing Data with Imputations in R

Missing not at Random (MNAR)

Locations of missing values in the dataset depend on the missing values themselves.

Example:

When it's extremely cold, the weather sensor freezes and stops working. So, it does not record very low temperatures. Thus, the locations of missing values in the temperature variable depend on the values of this variable themselves.

Handling Missing Data with Imputations in R

Handling the mechanisms

What if we simply drop incomplete observations?

  • If the data are MCAR, removing them results in an information loss.
  • If the data are MAR or MNAR, removing them introduces bias to models built on these data.
  • In this case, missing values should be imputed.
  • Many imputation methods assume MAR, so it's important to detect it.
Handling Missing Data with Imputations in R

Statistical testing

Example: t-test for difference in means

  1. Make an assumption (null hypothesis): the means are equal.
  2. Compute the test statistic from your data.
  3. Compute the p-value: how likely it is to obtain the test statistic that you got, assuming the null hypothesis is true?

p-value small → reject the null hypothesis → means are different


p-value large → don't reject the null hypothesis → means are equal

Handling Missing Data with Imputations in R

Testing for MAR

Goal: test if the percentage of missing values in one variable differs for different values of another variable.

Example: is the percentage of missing values in PhysActive different for males and females?

Testing procedure:

  1. Create a dummy variable denoting whether PhysActive is missing.
  2. Use a t-test to check if the mean of this dummy is different for males and females.
  3. If the p-value is small (e.g. < 0.05), the means are different, so the data are MAR.
Handling Missing Data with Imputations in R

Testing in practice

nhanes <- nhanes %>% 
  mutate(missing_phys_active = is.na(PhysActive))
missing_phys_active_male <- nhanes %>% 
  filter(Gender == "male") %>% 
  pull(missing_phys_active)

missing_phys_active_female <- nhanes %>% 
  filter(Gender == "female") %>% 
  pull(missing_phys_active)
Handling Missing Data with Imputations in R

Interpreting test results

t.test(missing_phys_active_female, missing_phys_active_male)
    Welch Two Sample t-test

data:  missing_phys_active_female and missing_phys_active_male
t = -1.7192, df = 781.18, p-value = 0.08597
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.044414688  0.002940477
sample estimates:
 mean of x  mean of y 
0.02083333 0.04157044
Handling Missing Data with Imputations in R

Let's practice recognizing missing data mechanisms!

Handling Missing Data with Imputations in R

Preparing Video For Download...