Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
movies
title avg_rating
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.8
6 Gone in Sixty Seconds 3.3
...
breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))
ggplot(movies, aes(avg_rating)) + geom_histogram(breaks = breaks)
library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)
Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].
There were 3 failures:
Position Value Cause
1 5 5.8 too high
2 8 6.2 too high
3 9 -4.4 too low
NA
)movies %>% filter(avg_rating >= 0, avg_rating <= 5) %>%
ggplot(aes(avg_rating)) + geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))
movies
title avg_rating
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.8
6 Gone in Sixty Seconds 3.3
...
replace(col, condition, replacement)
movies %>%
mutate(rating_miss =
replace(avg_rating, avg_rating > 5, NA))
title rating_miss
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable NA
6 Gone in Sixty Seconds 3.3
...
movies %>%
mutate(rating_const =
replace(avg_rating, avg_rating > 5, 5))
title rating_const
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.0
6 Gone in Sixty Seconds 3.3
...
assert_all_are_in_past(movies$date_recorded)
Error: is_in_past : movies$date_recorded are not all in the past.
There was 1 failure:
Position Value Cause
1 3 2064-09-22 20:00:00 in future
library(lubridate)
movies %>%
filter(date_recorded > today())
title avg_rating date_recorded
1 Amelie 4.2 2064-09-23
library(lubridate)
movies <- movies %>%
filter(date_recorded <= today())
library(assertive)
assert_all_are_in_past(movies$date_recorded)
Remember, no output = passed!
Cleaning Data in R