Filtering rows in a data.table

Data Manipulation with data.table in R

Matt Dowle and Arun Srinivasan

Instructors, DataCamp

General form of data.table syntax

First argument i is used to subset or filter rows

# General form of data.table syntax
DT[i, j, by]
   |  |  |
   |  |  --> grouped by what?
   |  -----> what to do?
   --------> on which rows?
Data Manipulation with data.table in R

Row numbers

# Subset 3rd and 4th rows from batrips
batrips[3:4]

# Same as
batrips[3:4, ]
# Subset everything except first five rows
batrips[-(1:5)] 

# Same as
batrips[!(1:5)]
Data Manipulation with data.table in R

Special symbol .N

  • .N is an integer value that contains the number of rows in the data.table
  • Useful alternative to nrow(x) in i
nrow(batrips) 
326339
batrips[326339]
trip_id duration
588914      364
# Returns the last row
batrips[.N] 
trip_id duration
588914      364
# Return all but the last 10 rows
ans <- batrips[1:(.N-10)] 
nrow(ans)
326329
Data Manipulation with data.table in R

Logical expressions (I)

# Subset rows where subscription_type is "Subscriber"
batrips[subscription_type == "Subscriber"]

# If batrips was only a data frame
batrips[batrips$subscription_type == "Subscriber", ]
Data Manipulation with data.table in R

Logical expressions (II)

# Subset rows where start_terminal = 58 and end_terminal is not 65
batrips[start_terminal == 58 & end_terminal != 65]

# If batrips was only a data frame
batrips[batrips$start_terminal == 58 & batrips$end_terminal != 65]
Data Manipulation with data.table in R

Logical expressions (III)

Optimized using secondary indices for speed automatically

set.seed(1)
dt <- data.table(x = sample(10000, 10e6, TRUE), 
                 y = sample(letters, 1e6, TRUE))
indices(dt)
NULL
# 0.207s on first run 
#(time to create index + subset)
system.time(dt[x == 900])
user  system elapsed 
0.207   0.015   0.226 
indices(dt)
"x"
# 0.002s on subsequent runs
#(instant subset using index)
system.time(dt[x == 900])
user  system elapsed 
0.002   0.000   0.002
Data Manipulation with data.table in R

Let's practice!

Data Manipulation with data.table in R

Preparing Video For Download...