Data Manipulation with data.table in R
Matt Dowle, Arun Srinivasan
Instructors, DataCamp
Second argument j
is used to select (and compute on) columns
# General form of data.table syntax
DT[i, j, by]
| | |
| | --> grouped by what?
| -----> what to do?
--------> on which rows?
j
argument accepts a character vector of column names
ans <- batrips[, c("trip_id", "duration")]
head(ans, 2)
trip_id duration
139545 435
139546 432
batrips_df <- as.data.frame(batrips)
ans <- batrips_df[, "trip_id"]
head(ans, 2)
# The result is a vector, not a data.frame
139545, 139546
ans <- batrips[, "trip_id"]
# Still a data.table, not a vector
head(ans, 2)
trip_id
139545
139546
Column numbers instead of names work just fine
ans <- batrips[, c(2, 4)]
head(ans, 2)
duration start_station
435 San Francisco City Hall
432 San Francisco City Hall
However, we consider this a bad practice
# If the order of columns changes, the result is wrong
batrips[, c(2, 4)]
# The result is always correct, no matter the order
batrips[, c("duration", "start_station")]
-c("col1", "col2", ...)
deselects the specified columns !
instead of -
works the same way # Select all cols *except* those shown below
ans <- batrips[, -c("start_date", "end_date", "end_station")]
head(ans, 1)
trip_id duration start_station start_terminal bike_id end_terminal
139545 435 San Francisco City Hall 58 65 473
subscription_type zip_code
Subscriber 94612
Remember how columns were used as if they are variables in i
argument in the last chapter?
# Recap the "i" argument
# All trips more than an hour
batrips[duration > 3600]
Similarly, you can use a list of variables (column names) to select columns
ans <- batrips[, list(trip_id, dur = duration)]
head(ans, 2)
trip_id dur
139545 435
139546 432
When selecting a single column, not wrapping the variable by list()
returns a vector
# Select a single column and return a data.table
ans <- batrips[, list(trip_id)]
head(ans ,2)
trip_id
139545
139546
# Select a single column and return a vector
ans <- batrips[, trip_id]
head(ans, 2)
139545 139546
.()
is an alias to list()
, for convenience
# .() is the same as list()
ans <- batrips[, .(trip_id, duration)]
head(ans, 2)
trip_id duration
139545 435
139546 432
Data Manipulation with data.table in R