Adding and updating columns by reference

Data Manipulation with data.table in R

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

data.frame internals

Let's say we would like to change the 2nd row of column "y" to 10

df <- data.frame(x = 1:5, y = 6:10)
df
x  y
1  6
2  7
df$y[2] <- 10
Data Manipulation with data.table in R

data.frame internals

In R < v3.1.0, this operation resulted in deep copying the entire data.frame

# what happens internally prior to R v3.1.0
tmp <- <deep copy of "df">
tmp$y[2] <- 10
df <- tmp
  • What happens if you would like to do the same operation on a 10GB data.frame?
Data Manipulation with data.table in R

data.frame internals

  • In v3.1.0, improvements were made to deep copy only the column that is updated

  • In this case, just columns a and b are deep copied in the operation performed on df below

df <- data.frame(a = 1:3, b = 4:6, c = 7:9, d = 10:12)
df[1:2] <- lapply(df[1:2], function(x) ifelse(x%%2, x, NA))
df
 a  b c  d
 1 NA 7 10
NA  5 8 11
 3 NA 9 12
Data Manipulation with data.table in R

data.table internals

  • data.table updates columns in place, i.e., by reference

  • This means, you don't need the assign the result back to a variable

  • No copy of any column is made while their values are changed

  • data.table uses a new operator := to add/update/delete columns by reference

Data Manipulation with data.table in R

LHS := RHS form

batrips[, c("is_dur_gt_1hour", "week_day") := list(duration > 3600, 
                                                   wday(start_date))]

# When adding a single column quotes aren't necessary batrips[, is_dur_gt_1hour := duration > 3600]
Data Manipulation with data.table in R

Functional form

batrips[, `:=`(is_dur_gt_1hour = NULL,                  
               start_station = toupper(start_station))] 
Data Manipulation with data.table in R

Let's practice!

Data Manipulation with data.table in R

Preparing Video For Download...