Selecting columns from a data.table

Data Manipulation with data.table in R

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

General form of data.table syntax (Recap)

Second argument j is used to select (and compute on) columns

# General form of data.table syntax
DT[i, j, by]
   |  |  |
   |  |  --> grouped by what?
   |  -----> what to do?
   --------> on which rows?
Data Manipulation with data.table in R

Using column names to select columns

j argument accepts a character vector of column names

ans <- batrips[, c("trip_id", "duration")]
head(ans, 2)
trip_id duration
139545      435
139546      432
Data Manipulation with data.table in R

Using column names to select columns

batrips_df <- as.data.frame(batrips)
ans <- batrips_df[, "trip_id"]
head(ans, 2)
# The result is a vector, not a data.frame
139545, 139546
ans <- batrips[, "trip_id"]
# Still a data.table, not a vector
head(ans, 2) 
trip_id
139545
139546
Data Manipulation with data.table in R

Column numbers instead of names work just fine

ans <- batrips[, c(2, 4)]
head(ans, 2)
duration  start_station
435       San Francisco City Hall
432       San Francisco City Hall

However, we consider this a bad practice

# If the order of columns changes, the result is wrong
batrips[, c(2, 4)]

# The result is always correct, no matter the order
batrips[, c("duration", "start_station")]
Data Manipulation with data.table in R

Deselecting columns with character vectors

  • -c("col1", "col2", ...) deselects the specified columns
  • Convenience feature only in data.table
  • Using ! instead of - works the same way
# Select all cols *except* those shown below
ans <- batrips[, -c("start_date", "end_date", "end_station")]
head(ans, 1)
trip_id  duration  start_station             start_terminal  bike_id  end_terminal
139545   435       San Francisco City Hall   58              65       473 

subscription_type  zip_code
Subscriber         94612
Data Manipulation with data.table in R

Selecting columns the data.table way

Remember how columns were used as if they are variables in i argument in the last chapter?

# Recap the "i" argument
# All trips more than an hour
batrips[duration > 3600]

Similarly, you can use a list of variables (column names) to select columns

ans <- batrips[, list(trip_id, dur = duration)]
head(ans, 2)
trip_id     dur
139545      435
139546      432
Data Manipulation with data.table in R

When selecting a single column, not wrapping the variable by list() returns a vector

# Select a single column and return a data.table
ans <- batrips[, list(trip_id)]
head(ans ,2)
trip_id
139545
139546
# Select a single column and return a vector
ans <- batrips[, trip_id]
head(ans, 2)
139545 139546
Data Manipulation with data.table in R

Selecting columns the data.table way

.() is an alias to list(), for convenience

# .() is the same as list()
ans <- batrips[, .(trip_id, duration)]
head(ans, 2)
trip_id duration
139545      435
139546      432
Data Manipulation with data.table in R

Let's practice!

Data Manipulation with data.table in R

Preparing Video For Download...