Kendala rentang

Membersihkan Data di R

Maggie Matsui

Content Developer @ DataCamp

Apa itu nilai di luar rentang?

  • Skor SAT: 400–1600
  • Berat paket: minimal 0 lb/kg
  • Detak jantung dewasa: 60–100 denyut per menit
Membersihkan Data di R

Menemukan nilai di luar rentang

movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...
Membersihkan Data di R

Menemukan nilai di luar rentang

breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))

ggplot(movies, aes(avg_rating)) + geom_histogram(breaks = breaks)

Histogram dengan tiga bin dari kode, satu nilai terlalu rendah, enam nilai dalam rentang, dan dua nilai terlalu tinggi.

Membersihkan Data di R

Menemukan nilai di luar rentang

library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)
Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].
There were 3 failures:
  Position Value    Cause
1        5   5.8 too high
2        8   6.2 too high
3        9  -4.4  too low
Membersihkan Data di R

Menangani nilai di luar rentang

  • Hapus baris
  • Perlakukan sebagai hilang (NA)
  • Ganti dengan batas rentang
  • Ganti dengan nilai lain berdasarkan pengetahuan domain dan/atau dataset
Membersihkan Data di R

Menghapus baris

movies %>%
  filter(avg_rating >= 0, avg_rating <= 5) %>%


ggplot(aes(avg_rating)) + geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))

Histogram yang dihasilkan kode dengan enam nilai dalam rentang, dan 0 nilai di bawah atau di atas batas rentang.

Membersihkan Data di R

Perlakukan sebagai hilang

movies
  title                 avg_rating
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.8
6 Gone in Sixty Seconds        3.3
...

replace(col, condition, replacement)

movies %>%
  mutate(rating_miss = 
    replace(avg_rating, avg_rating > 5, NA))
  title                rating_miss
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                   NA
6 Gone in Sixty Seconds        3.3
...
Membersihkan Data di R

Mengganti nilai di luar rentang

movies %>%
  mutate(rating_const = 
           replace(avg_rating, avg_rating > 5, 5))
  title               rating_const
  <chr>                      <dbl>
1 A Beautiful Mind             4.1
2 La Vita e Bella              4.3
3 Amelie                       4.2
4 Meet the Parents             3.5
5 Unbreakable                  5.0
6 Gone in Sixty Seconds        3.3
...
Membersihkan Data di R

Kendala rentang tanggal

assert_all_are_in_past(movies$date_recorded)
Error: is_in_past : movies$date_recorded are not all in the past.
There was 1 failure:
  Position               Value     Cause
1        3 2064-09-22 20:00:00 in future
library(lubridate)
movies %>%
  filter(date_recorded > today())
    title  avg_rating  date_recorded
1  Amelie         4.2  2064-09-23
Membersihkan Data di R

Menghapus tanggal di luar rentang

library(lubridate)
movies <- movies %>%
  filter(date_recorded <= today())
library(assertive)
assert_all_are_in_past(movies$date_recorded)


Ingat, tidak ada output = lulus!

Membersihkan Data di R

Ayo berlatih!

Membersihkan Data di R

Preparing Video For Download...