How to summarise missing values

Dealing With Missing Data in R

Nicholas Tierney

Statistician

Introduction to missingness summaries

Basic summaries of missingness:

  • n_miss
  • n_complete

Dataframe summaries of missingness:

  • miss_var_summary
  • miss_case_summary

These functions work with group_by

Dealing With Missing Data in R

Missing data summaries: Variables

miss_var_summary(airquality)
# A tibble: 6 x 3
  variable n_miss pct_miss 
  <chr>     <int>    <dbl> 
1 Ozone        37    24.2  
2 Solar.R       7     4.58 
3 Wind          0     0    
4 Temp          0     0    
5 Month         0     0    
6 Day           0     0
Dealing With Missing Data in R

Missing data summaries: Cases

miss_case_summary(airquality)
# A tibble: 153 x 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ... with 143 more rows
Dealing With Missing Data in R

Missing data tabulations

miss_var_table(airquality)
# A tibble: 3 x 3
  n_miss_in_var n_vars pct_var
          <int>  <int>    <dbl>
1             0      4     66.7
2             7      1     16.7
3            37      1     16.7
miss_case_table(airquality)
# A tibble: 3 x 3
  n_miss_in_case n_cases pct_case
           <int>   <int>    <dbl>
1              0     111    72.5 
2              1      40    26.1 
3              2       2     1.31
Dealing With Missing Data in R

Missing data summaries: Spans of missing data

miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# A tibble: 10 x 5
   span_counter n_miss n_complete prop_miss prop_complete
          <int>  <int>      <dbl>     <dbl>         <dbl>
 1            1      0       4000   0               1    
 2            2      1       3999   0.00025         1.000
 3            3    121       3879   0.0302          0.970
 4            4    503       3497   0.126           0.874
 5            5    745       3255   0.186           0.814
 6            6      0       4000   0               1    
 7            7      1       3999   0.00025         1.000
 8            8      0       4000   0               1    
 9            9    745       3255   0.186           0.814
10           10    432       3568   0.108           0.892
Dealing With Missing Data in R

Missing data summaries: Runs of missing data

miss_var_run(pedestrian, hourly_counts)
# A tibble: 35 x 2
   run_length is_na   
        <int> <chr>   
 1       6628 complete
 2          1 missing 
 3       5250 complete
 4        624 missing 
 5       3652 complete
 6          1 missing 
 7       1290 complete
 8        744 missing 
 9       7420 complete
10          1 missing 
# ... with 25 more rows
Dealing With Missing Data in R

Using summaries with group_by

airquality %>%
  group_by(Month) %>%
  miss_var_summary()
# A tibble: 25 x 4
   Month variable n_miss pct_miss 
   <int> <chr>     <int>    <dbl> 
 1     5 Ozone         5     16.1 
 2     5 Solar.R       4     12.9 
 3     5 Wind          0      0   
 4     5 Temp          0      0   
 5     5 Day           0      0   
 6     6 Ozone        21     70   
 7     6 Solar.R       0      0   
# ... with 18 more rows
Dealing With Missing Data in R

Let's practice!

Dealing With Missing Data in R

Preparing Video For Download...