Congratulations!

Dealing With Missing Data in R

Nicholas Tierney

Statistician

Chapter 1

What missing values are

Missing values are values that should have been recorded but were not.

How to summarize missing values

miss_var_summary(airquality)
A tibble: 6 x 3
  variable n_miss pct_miss 
  <chr>     <int>    <dbl> 
1 Ozone        37    24.2  
2 Solar.R       7     4.58 
3 Wind          0     0    
4 Temp          0     0    
5 Month         0     0    
6 Day           0     0
Dealing With Missing Data in R

Chapter 1

vis_miss(airquality)

gg_miss_var(airquality, facet=Month)

Dealing With Missing Data in R

Chapter 2

Find alternative missing values

miss_scan_count(data = pacman, 
                search = list("N/A"))

Replace alternative missing values

replace_with_na(pacman, 
                replace = list(
                  year = c("N/A"),
                  score = c("N/A")))

Implicit Missing values

frogger_tidy <- frogger %>% 
  complete(time, name)

Missing Data Dependence

  • MCAR
  • MAR
  • MNAR
Dealing With Missing Data in R

Chapter 3

shadow matrix, nabular data

nabular(airquality)
 # A tibble: 153 x 12
    Ozone Solar.R  Wind  Temp 
    <int>   <int> <dbl> <int> 
  1   41     190   7.4    67  
  2   36     118   8      72  
  3   12     149  12.6    74  
 # ... with 150 more rows, and 3 
 # more variables: Month <int>, Day <int>,
 # Ozone_NA <fct>, Solar.R_NA <fct>, 
 # Wind_NA <fct>, Temp_NA <fct>,
 # Month_NA <fct>, Day_NA <fct>

Explore missingness, link summaries to data values

oceanbuoys %>%
  bind_shadow() %>%
  group_by(humidity_NA) %>%
  summarize(
    wind_ew_mean = mean(wind_ew))
 # A tibble: 2 x 2
   humidity_NA wind_ew_mean
   <fct>              <dbl>
 1 !NA                -3.78
 2 NA                 -3.30
Dealing With Missing Data in R

Chapter 3

How values change with missingness.

nabular(oceanbuoys) %>%
  ggplot(aes(x = wind_ew, 
             color = air_temp_c_NA)) + 
  geom_density()

ggplot density by air temperature

Visualize missings across 2 variables.

ggplot(oceanbuoys,
       aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point()

gg-miss-point

Dealing With Missing Data in R

Chapter 4

Good and bad imputations

naniar::impute_mean_all()
simputation::impute_lm()

impuate-mean-image

Compare imputed and original values

ggplot(ocean_imp_track, 
       aes(x = air_temp_c, 
           fill = air_temp_c_NA)) + 
  geom_histogram()

impute-hist

Dealing With Missing Data in R

Chapter 4

Using different imputation models

comparing a linear model imputation to mean imputation

How imputation models affect subsequent inference

# A tibble: 12 x 6
   imp_model term  estimate
   <chr>     <chr>    <dbl>
 1 cc        (Int… -7.35e+2
 2 cc        air_…  8.64e-1
 3 cc        humi…  3.41e-2
 4 cc        year   3.69e-1
 5 imp_lm_w… (Int… -1.71e+3
 6 imp_lm_w… air_…  3.78e-1
# ... 6 more rows
# ... with 3 more variables:
#   std.error <dbl>,
#   statistic <dbl>,
#   p.value <dbl>
Dealing With Missing Data in R

This is only the beginning!

Dealing With Missing Data in R

Thank you!

Dealing With Missing Data in R

Preparing Video For Download...