Missing Data Workflows: The Shadow matrix and Nabular data

Dealing With Missing Data in R

Nicholas Tierney

Statistician

An example

Census data containing:

  • Income
  • Education
income education
48.69087 NA
40.93218 NA
52.69245 high_school
31.33808 NA
89.35671 university
103.87278 university
Dealing With Missing Data in R

What we are going to cover

Dealing With Missing Data in R

The shadow matrix

Dealing With Missing Data in R

The shadow matrix

Two main features

  1. Coordinated names
  2. Clear values
Dealing With Missing Data in R

Creating nabular data

income education income_NA education_NA
48.69087 NA !NA NA
40.93218 NA !NA NA
52.69245 high_school !NA !NA
31.33808 NA !NA NA
89.35671 university !NA !NA
103.87278 university !NA !NA
Dealing With Missing Data in R

Using nabular data to perform summaries

bind_shadow(airquality)
# A tibble: 153 x 12
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
 2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
 3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
 4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
 5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
 6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
 7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
 8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
 9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 2 more variables: Month_NA <fct>, Day_NA <fct>
Dealing With Missing Data in R

Using nabular data to perform summaries

airquality %>%
  bind_shadow() %>%
  group_by(Ozone_NA) %>%
  summarize(mean = mean(Wind))
Ozone_NA mean
!NA 9.862069
NA 10.256757
Dealing With Missing Data in R

Let's practice!

Dealing With Missing Data in R

Preparing Video For Download...