Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp

Can be represented as NA, nan, 0, 99, . ...

Can be represented as NA, nan, 0, 99, . ...


Can be represented as NA, nan, 0, 99, . ...

head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
head(airquality)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
is.na(airquality)
     Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
# Count missing vals in entire dataset
sum(is.na(airquality))
44
library(visdat)
vis_miss(airquality)

airquality %>% mutate(miss_ozone = is.na(Ozone)) %>%group_by(miss_ozone) %>% summarize(across(everything(), median, na.rm = TRUE))
  miss_ozone Ozone Solar.R  Wind  Temp Month   Day
  <lgl>      <dbl>   <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE       31.5     207   9.7    65     7    16
2 TRUE        NA       194   9.7    99     6    15
airquality %>%
  arrange(Temp) %>%
  vis_miss()





Simple approaches:
More complex approaches:
Learn more in Dealing with Missing Data in R
airquality %>%
  filter(!is.na(Ozone), !is.na(Solar.R))
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118     8    72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    23     299   8.6    65     5     7
 6    19      99  13.8    59     5     8
airquality %>%
  mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
   Ozone Solar.R  Wind  Temp Month   Day ozone_filled
   <int>   <int> <dbl> <int> <int> <int>        <dbl>
 1    41     190   7.4    67     5     1         41  
 2    36     118   8      72     5     2         36  
 3    12     149  12.6    74     5     3         12  
 4    18     313  11.5    62     5     4         18  
 5    NA      NA  14.3    56     5     5         42.1
Cleaning Data in R