Advanced aggregations

R'de data.table ile Veri İşleme

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

Recap

# Same example as seen before
## LHS := RHS Form
batrips[, c("is_dur_gt_1hour", "week_day") := 
            .(duration > 3600, wday(start_date)]
# Same as above, but in `:=`() functional form
batrips[, `:=`(is_dur_gt_1hour = duration > 3600,
               week_day = wday(start_date))]
# Update by reference with by
batrips[, n_zip_code := .N, by = zip_code]
R'de data.table ile Veri İşleme

Adding multiple columns by reference by group

# Functional form
batrips[, `:=`(end_dur_first = duration[1], 
               end_dur_last  = duration[.N]),
        by = end_station]

# LHS := RHS form batrips[, c("end_dur_first", "end_dur_last") := list(duration[1], duration[.N]), by = end_station]
batrips[1:5]
   trip_id duration ...        end_station ... end_dur_first end_dur_last 
    139545      435 ...    Townsend at 7th ...           435          660
    139546      432 ...    Townsend at 7th ...           435          660
    139547     1523 ...    Beale at Market ...          1523          229
    139549     1620 ... Powell Street BART ...          1620          540
    139550     1617 ... Powell Street BART ...          1620          540
R'de data.table ile Veri İşleme

Binning values

For each unique combination of start_station and end_station, if median duration:

  • less than 600, "short"
  • between 600 and 1800, "medium"
  • "long", otherwise
R'de data.table ile Veri İşleme

Multi-line expressions in j

batrips[, trip_category := {
             med_dur = median(duration, na.rm = TRUE)
             if (med_dur < 600) "short"
             else if (med_dur >= 600 & med_dur <= 1800) "medium"
             else "long"
           },
        by = .(start_station, end_station)]
batrips[1:3]
trip_id duration ... zip_code trip_category
 139545      435 ...    94612         short
 139546      432 ...    94107         short
 139547     1523 ...    94112         short
R'de data.table ile Veri İşleme

Alternative way

bin_median_duration <- function(dur) {
  med_dur <- median(dur, na.rm = TRUE)
  if (med_dur < 600) "short"
  else if (med_dur >= 600 & med_dur <= 1800) "medium"
  else "long"
}

batrips[, trip_category := bin_median_duration(duration), 
           by = .(start_station, end_station)]
R'de data.table ile Veri İşleme

All together - i, j and by

batrips[duration > 500, min_dur_gt_500 := min(duration), 
           by = .(start_station, end_station)]
batrips[1:3]
trip_id duration ... zip_code min_dur_gt_500
 139545      435 ...    94612             NA
 139546      432 ...    94107             NA
 139547     1523 ...    94112            502
R'de data.table ile Veri İşleme

Let's practice!

R'de data.table ile Veri İşleme

Preparing Video For Download...