Completeness

Pulizia dei dati in R

Maggie Matsui

Content Developer @ DataCamp

What is missing data?

A completed puzzle with a missing piece. Description: Occurs when no data value is stored for a variable in an observation.

Can be represented as NA, nan, 0, 99, . ...

Pulizia dei dati in R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A robot representing technical errors.

Pulizia dei dati in R

What is missing data?

missing.png

Can be represented as NA, nan, 0, 99, . ...

A figure of a person representing human error.

Pulizia dei dati in R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Pulizia dei dati in R

Air quality

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
Pulizia dei dati in R

Finding missing values

is.na(airquality)
     Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
Pulizia dei dati in R

Counting missing values

# Count missing vals in entire dataset
sum(is.na(airquality))
44
Pulizia dei dati in R

Visualizing missing values

library(visdat)
vis_miss(airquality)

Visualization created from code. Leftmost column has the most black lines, and the second column has the next most lines. None of the other columns have black lines.

Pulizia dei dati in R

Investigating missingness

airquality %>%
  mutate(miss_ozone = is.na(Ozone)) %>%

group_by(miss_ozone) %>% summarize(across(everything(), median, na.rm = TRUE))
  miss_ozone Ozone Solar.R  Wind  Temp Month   Day
  <lgl>      <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE       31.5     207   9.7    65     7    16
2 TRUE        NA       194   9.7    99     6    15
Pulizia dei dati in R

Investigating missingness

airquality %>%
  arrange(Temp) %>%
  vis_miss()

Same visualization as before, but in the left column representing Ozone, all the missing values are in the bottom.

Pulizia dei dati in R

Types of missing data

Left: A six sided die to represent Missing Completely at Random. Middle: A six sided die with only one dot on each side to represent missing at random. Right: Four squares pointing to each other in a cycle to represent missing not at random.

Pulizia dei dati in R

Types of missing data

Missing completely at random: no systematic relationship between missing data and other values. Example: Data entry errors when inputting data.

Pulizia dei dati in R

Types of missing data

Missing at random: Systematic relationship between missing data and other observed values. Example: Missing ozone data for high temperatures.

Pulizia dei dati in R

Types of missingness

Missing not at random: Systematic relationship between missing data and unobserved values. Example: missing temperature values for high temperatures.

Pulizia dei dati in R

Dealing with missing data

Simple approaches:

  1. Drop missing data
  2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge

More complex approaches:

  1. Impute using an algorithmic approach
  2. Impute with machine learning models

 

Learn more in Dealing with Missing Data in R

Pulizia dei dati in R

Dropping missing values

airquality %>%
  filter(!is.na(Ozone), !is.na(Solar.R))
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118     8    72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    23     299   8.6    65     5     7
 6    19      99  13.8    59     5     8
Pulizia dei dati in R

Replacing missing values

airquality %>%
  mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
   Ozone Solar.R  Wind  Temp Month   Day ozone_filled
   <int>   <int> <dbl> <int> <int> <int>        <dbl>
 1    41     190   7.4    67     5     1         41  
 2    36     118   8      72     5     2         36  
 3    12     149  12.6    74     5     3         12  
 4    18     313  11.5    62     5     4         18  
 5    NA      NA  14.3    56     5     5         42.1
Pulizia dei dati in R

Let's practice!

Pulizia dei dati in R

Preparing Video For Download...