Completeness

Nettoyer des données avec R

Maggie Matsui

Content Developer @ DataCamp

What is missing data?

A completed puzzle with a missing piece. Description: Occurs when no data value is stored for a variable in an observation.

Can be represented as NA, nan, 0, 99, . ...

Nettoyer des données avec R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A robot representing technical errors.

Nettoyer des données avec R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A figure of a person representing human error.

Nettoyer des données avec R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Nettoyer des données avec R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Nettoyer des données avec R

Finding missing values

is.na(airquality)
     Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
Nettoyer des données avec R

Counting missing values

# Count missing vals in entire dataset
sum(is.na(airquality))
44
Nettoyer des données avec R

Visualizing missing values

library(visdat)
vis_miss(airquality)

Visualization created from code. Leftmost column has the most black lines, and the second column has the next most lines. None of the other columns have black lines.

Nettoyer des données avec R

Investigating missingness

airquality %>%
  mutate(miss_ozone = is.na(Ozone)) %>%

group_by(miss_ozone) %>% summarize(across(everything(), median, na.rm = TRUE))
  miss_ozone Ozone Solar.R  Wind  Temp Month   Day
  <lgl>      <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE       31.5     207   9.7    65     7    16
2 TRUE        NA       194   9.7    99     6    15
Nettoyer des données avec R

Investigating missingness

airquality %>%
  arrange(Temp) %>%
  vis_miss()

Same visualization as before, but in the left column representing Ozone, all the missing values are in the bottom.

Nettoyer des données avec R

Types of missing data

Left: A six sided die to represent Missing Completely at Random. Middle: A six sided die with only one dot on each side to represent missing at random. Right: Four squares pointing to each other in a cycle to represent missing not at random.

Nettoyer des données avec R

Types of missing data

Missing completely at random: no systematic relationship between missing data and other values. Example: Data entry errors when inputting data.

Nettoyer des données avec R

Types of missing data

Missing at random: Systematic relationship between missing data and other observed values. Example: Missing ozone data for high temperatures.

Nettoyer des données avec R

Types of missingness

Missing not at random: Systematic relationship between missing data and unobserved values. Example: missing temperature values for high temperatures.

Nettoyer des données avec R

Dealing with missing data

Simple approaches:

  1. Drop missing data
  2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge

More complex approaches:

  1. Impute using an algorithmic approach
  2. Impute with machine learning models

 

Learn more in Dealing with Missing Data in R

Nettoyer des données avec R

Dropping missing values

airquality %>%
  filter(!is.na(Ozone), !is.na(Solar.R))
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118     8    72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    23     299   8.6    65     5     7
 6    19      99  13.8    59     5     8
Nettoyer des données avec R

Replacing missing values

airquality %>%
  mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
   Ozone Solar.R  Wind  Temp Month   Day ozone_filled
   <int>   <int> <dbl> <int> <int> <int>        <dbl>
 1    41     190   7.4    67     5     1         41  
 2    36     118   8      72     5     2         36  
 3    12     149  12.6    74     5     3         12  
 4    18     313  11.5    62     5     4         18  
 5    NA      NA  14.3    56     5     5         42.1
Nettoyer des données avec R

Let's practice!

Nettoyer des données avec R

Preparing Video For Download...