Completeness

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

What is missing data?

A completed puzzle with a missing piece. Description: Occurs when no data value is stored for a variable in an observation.

Can be represented as NA, nan, 0, 99, . ...

Cleaning Data in R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A robot representing technical errors.

Cleaning Data in R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A figure of a person representing human error.

Cleaning Data in R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Cleaning Data in R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Cleaning Data in R

Finding missing values

is.na(airquality)
     Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
Cleaning Data in R

Counting missing values

# Count missing vals in entire dataset
sum(is.na(airquality))
44
Cleaning Data in R

Visualizing missing values

library(visdat)
vis_miss(airquality)

Visualization created from code. Leftmost column has the most black lines, and the second column has the next most lines. None of the other columns have black lines.

Cleaning Data in R

Investigating missingness

airquality %>%
  mutate(miss_ozone = is.na(Ozone)) %>%

group_by(miss_ozone) %>% summarize(across(everything(), median, na.rm = TRUE))
  miss_ozone Ozone Solar.R  Wind  Temp Month   Day
  <lgl>      <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE       31.5     207   9.7    65     7    16
2 TRUE        NA       194   9.7    99     6    15
Cleaning Data in R

Investigating missingness

airquality %>%
  arrange(Temp) %>%
  vis_miss()

Same visualization as before, but in the left column representing Ozone, all the missing values are in the bottom.

Cleaning Data in R

Types of missing data

Left: A six sided die to represent Missing Completely at Random. Middle: A six sided die with only one dot on each side to represent missing at random. Right: Four squares pointing to each other in a cycle to represent missing not at random.

Cleaning Data in R

Types of missing data

Missing completely at random: no systematic relationship between missing data and other values. Example: Data entry errors when inputting data.

Cleaning Data in R

Types of missing data

Missing at random: Systematic relationship between missing data and other observed values. Example: Missing ozone data for high temperatures.

Cleaning Data in R

Types of missingness

Missing not at random: Systematic relationship between missing data and unobserved values. Example: missing temperature values for high temperatures.

Cleaning Data in R

Dealing with missing data

Simple approaches:

  1. Drop missing data
  2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge

More complex approaches:

  1. Impute using an algorithmic approach
  2. Impute with machine learning models

 

Learn more in Dealing with Missing Data in R

Cleaning Data in R

Dropping missing values

airquality %>%
  filter(!is.na(Ozone), !is.na(Solar.R))
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118     8    72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    23     299   8.6    65     5     7
 6    19      99  13.8    59     5     8
Cleaning Data in R

Replacing missing values

airquality %>%
  mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
   Ozone Solar.R  Wind  Temp Month   Day ozone_filled
   <int>   <int> <dbl> <int> <int> <int>        <dbl>
 1    41     190   7.4    67     5     1         41  
 2    36     118   8      72     5     2         36  
 3    12     149  12.6    74     5     3         12  
 4    18     313  11.5    62     5     4         18  
 5    NA      NA  14.3    56     5     5         42.1
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...