Range constraints

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

What's an out of range value?

  • SAT score: 400-1600
  • Package weight: at least 0 lb/kg
  • Adult heart rate: 60-100 beats per minute
Cleaning Data in R

Finding out of range values

movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...
Cleaning Data in R

Finding out of range values

breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))

ggplot(movies, aes(avg_rating)) + geom_histogram(breaks = breaks)

Histogram with three bins produced by code with one value in the too low box, six values in range, and two values that are too high.

Cleaning Data in R

Finding out of range values

library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)
Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].
There were 3 failures:
  Position Value    Cause
1        5   5.8 too high
2        8   6.2 too high
3        9  -4.4  too low
Cleaning Data in R

Handling out of range values

  • Remove rows
  • Treat as missing (NA)
  • Replace with range limit
  • Replace with other value based on domain knowledge and/or knowledge of dataset
Cleaning Data in R

Removing rows

movies %>%
  filter(avg_rating >= 0, avg_rating <= 5) %>%


ggplot(aes(avg_rating)) + geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))

Histogram produced by code with six values in range, and 0 values below or above range limits.

Cleaning Data in R

Treat as missing

movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...

replace(col, condition, replacement)

movies %>%
  mutate(rating_miss = 
    replace(avg_rating, avg_rating > 5, NA))
  title                rating_miss
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                   NA
6 Gone in Sixty Seconds        3.3
...
Cleaning Data in R

Replacing out of range values

movies %>%
  mutate(rating_const = 
           replace(avg_rating, avg_rating > 5, 5))
  title               rating_const
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.0
6 Gone in Sixty Seconds        3.3
...
Cleaning Data in R

Date range constraints

assert_all_are_in_past(movies$date_recorded)
Error: is_in_past : movies$date_recorded are not all in the past.
There was 1 failure:
  Position               Value     Cause
1        3 2064-09-22 20:00:00 in future
library(lubridate)
movies %>%
  filter(date_recorded > today())
    title  avg_rating  date_recorded
1  Amelie         4.2  2064-09-23
Cleaning Data in R

Removing out-of-range dates

library(lubridate)
movies <- movies %>%
  filter(date_recorded <= today())
library(assertive)
assert_all_are_in_past(movies$date_recorded)


Remember, no output = passed!

Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...