Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...
breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))ggplot(movies, aes(avg_rating)) + geom_histogram(breaks = breaks)

library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)
Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].
There were 3 failures:
  Position Value    Cause
1        5   5.8 too high
2        8   6.2 too high
3        9  -4.4  too low
NA)movies %>% filter(avg_rating >= 0, avg_rating <= 5) %>%ggplot(aes(avg_rating)) + geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))

movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...
replace(col, condition, replacement)
movies %>%
  mutate(rating_miss = 
    replace(avg_rating, avg_rating > 5, NA))
  title                rating_miss
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                   NA
6 Gone in Sixty Seconds        3.3
...
movies %>%
  mutate(rating_const = 
           replace(avg_rating, avg_rating > 5, 5))
  title               rating_const
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.0
6 Gone in Sixty Seconds        3.3
...
assert_all_are_in_past(movies$date_recorded)
Error: is_in_past : movies$date_recorded are not all in the past.
There was 1 failure:
  Position               Value     Cause
1        3 2064-09-22 20:00:00 in future
library(lubridate)
movies %>%
  filter(date_recorded > today())
    title  avg_rating  date_recorded
1  Amelie         4.2  2064-09-23
library(lubridate)
movies <- movies %>%
  filter(date_recorded <= today())
library(assertive)
assert_all_are_in_past(movies$date_recorded)
Remember, no output = passed!
Cleaning Data in R