Are the data missing at random?

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

Scalable Data Processing in R

Types of Missing Data

  • Missing Completely at Random (MCAR)
  • Missing at Random (MAR)
  • Missing Not at Random (MNAR)
Scalable Data Processing in R

MCAR

Missing Completely at Random

  • There is no way to predict which values are missing
  • Can drop missing data
Scalable Data Processing in R

MAR

Missing at Random

  • Missingness is dependent on variables in the data set
  • Use multiple imputation to predict what missing values could be
Scalable Data Processing in R

MNAR

Missing Not at Random

  • Not MCAR or MAR
  • Deterministic relationship between variables
Scalable Data Processing in R

Dealing with missing data in this course

  • Full treatment of missingness is beyond the scope of this course

  • We will check to see if it's plausible data are MCAR and drop missing values

Scalable Data Processing in R

A Quick Check for MAR

  • Recode a column with one if the data is missing and zero otherwise
  • Regress other variables onto it using a logistic regression
  • Significant p-value indicates MAR
  • Repeat for other columns with missingness
  • Some p-values can be significant by chance, so adjust your cutoff for significance based on the number of regressions
Scalable Data Processing in R

MAR Quick Check Example

# Our dependent variable
is_missing <- rbinom(1000, 1, 0.5)

# Our independent variables data_matrix <- matrix(rnorm(1000*10), nrow = 1000, ncol = 10) # A vector of p-values we'll fill in p_vals <- rep(NA, ncol(data_matrix))
Scalable Data Processing in R

MAR Quick Check Example

# Perform logistic regression
for (j in 1:ncol(data_matrix)) {
 s <- summary(glm(is_missing ~ data_matrix[, j]), 
              family = binomial)
              p_vals[j] <- s$coefficients[2, 4]
 }

# Show the p-values p_vals
0.5930082 0.7822695 0.7560343 0.3689330 0.8757048 
0.8812320 0.8281008 0.4888898 0.4781299 0.5655739
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...