Are the data missing at random?

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

Types of Missing Data

Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)

MCAR

Missing Completely at Random

There is no way to predict which values are missing
Can drop missing data

MAR

Missing at Random

Missingness is dependent on variables in the data set
Use multiple imputation to predict what missing values could be

MNAR

Missing Not at Random

Not MCAR or MAR
Deterministic relationship between variables

Dealing with missing data in this course

Full treatment of missingness is beyond the scope of this course
We will check to see if it's plausible data are MCAR and drop missing values

A Quick Check for MAR

Recode a column with one if the data is missing and zero otherwise
Regress other variables onto it using a logistic regression
Significant p-value indicates MAR
Repeat for other columns with missingness
Some p-values can be significant by chance, so adjust your cutoff for significance based on the number of regressions

MAR Quick Check Example

# Our dependent variable
is_missing <- rbinom(1000, 1, 0.5)

# Our independent variables
data_matrix <- matrix(rnorm(1000*10), nrow = 1000, 
                      ncol = 10)

# A vector of p-values we'll fill in
p_vals <- rep(NA, ncol(data_matrix))

MAR Quick Check Example

# Perform logistic regression
for (j in 1:ncol(data_matrix)) {
 s <- summary(glm(is_missing ~ data_matrix[, j]), 
              family = binomial)
              p_vals[j] <- s$coefficients[2, 4]
 }

# Show the p-values
p_vals

0.5930082 0.7822695 0.7560343 0.3689330 0.8757048 
0.8812320 0.8281008 0.4888898 0.4781299 0.5655739

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...