Data Manipulation with data.table in R
Matt Dowle and Arun Srinivasan
Instructors, DataCamp
First argument i
is used to subset or filter rows
# General form of data.table syntax
DT[i, j, by]
| | |
| | --> grouped by what?
| -----> what to do?
--------> on which rows?
# Subset 3rd and 4th rows from batrips
batrips[3:4]
# Same as
batrips[3:4, ]
# Subset everything except first five rows
batrips[-(1:5)]
# Same as
batrips[!(1:5)]
.N
is an integer value that contains the number of rows in the data.tablenrow(x)
in i
nrow(batrips)
326339
batrips[326339]
trip_id duration
588914 364
# Returns the last row
batrips[.N]
trip_id duration
588914 364
# Return all but the last 10 rows
ans <- batrips[1:(.N-10)]
nrow(ans)
326329
# Subset rows where subscription_type is "Subscriber"
batrips[subscription_type == "Subscriber"]
# If batrips was only a data frame
batrips[batrips$subscription_type == "Subscriber", ]
# Subset rows where start_terminal = 58 and end_terminal is not 65
batrips[start_terminal == 58 & end_terminal != 65]
# If batrips was only a data frame
batrips[batrips$start_terminal == 58 & batrips$end_terminal != 65]
Optimized using secondary indices for speed automatically
set.seed(1)
dt <- data.table(x = sample(10000, 10e6, TRUE),
y = sample(letters, 1e6, TRUE))
indices(dt)
NULL
# 0.207s on first run
#(time to create index + subset)
system.time(dt[x == 900])
user system elapsed
0.207 0.015 0.226
indices(dt)
"x"
# 0.002s on subsequent runs
#(instant subset using index)
system.time(dt[x == 900])
user system elapsed
0.002 0.000 0.002
Data Manipulation with data.table in R