Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
Can be represented as NA
, nan
, 0
, 99
, .
...
Can be represented as NA
, nan
, 0
, 99
, .
...
Can be represented as NA
, nan
, 0
, 99
, .
...
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
is.na(airquality)
Ozone Solar.R Wind Temp Month Day
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE FALSE
# Count missing vals in entire dataset
sum(is.na(airquality))
44
library(visdat)
vis_miss(airquality)
airquality %>% mutate(miss_ozone = is.na(Ozone)) %>%
group_by(miss_ozone) %>% summarize(across(everything(), median, na.rm = TRUE))
miss_ozone Ozone Solar.R Wind Temp Month Day
<lgl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE 31.5 207 9.7 65 7 16
2 TRUE NA 194 9.7 99 6 15
airquality %>%
arrange(Temp) %>%
vis_miss()
Simple approaches:
More complex approaches:
Learn more in Dealing with Missing Data in R
airquality %>%
filter(!is.na(Ozone), !is.na(Solar.R))
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 23 299 8.6 65 5 7
6 19 99 13.8 59 5 8
airquality %>%
mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
Ozone Solar.R Wind Temp Month Day ozone_filled
<int> <int> <dbl> <int> <int> <int> <dbl>
1 41 190 7.4 67 5 1 41
2 36 118 8 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 42.1
Cleaning Data in R