Incorporating joins into your data.table workflow

Joining Data with data.table in R

Scott Ritchie

Postdoctoral Researcher in Systems Genomics

Chaining data.table expressions

data.table expressions can be chained in sequence:

demographics[...][...]

General form of chaining a join:

DT1[DT2, on][i, j, by]
    |    |   |  |  |
    |    |   |  |   --> grouped by what?
    |    |   |   -----> what to do?
    |    |    --------> on which rows?
    |     ------------> join key columns
     -----------------> join to which data.table?
Joining Data with data.table in R

Join then compute

customers <- data.table(name   = c("Mark", "Matt", "Angela", "Michelle"), 
                        gender = c("M", "M", "F", "F"), 
                        age    = c(54, 43, 39, 63))
customers
       name gender age
1:     Mark      M  54
2:     Matt      M  43
3:   Angela      F  39
4: Michelle      F  63
Joining Data with data.table in R

Join then compute

purchases <- data.table(name  = c("Mark", "Matt", "Angela", "Michelle"),
                        sales = c(1, 5, 4, 3),
                        spent = c(41.70, 41.78, 50.77, 60.01))
purchases
       name sales spent
1:     Mark     1 41.70
2:     Matt     5 41.78
3:   Angela     4 50.77
4: Michelle     3 60.01
Joining Data with data.table in R

Join then compute

customers[purchases, 
          on = .(name)][sales > 1, 
                        j = .(avg_spent = sum(spent) / sum(sales)), 
                        by = .(gender)] 
   gender avg_spent
1:      M  13.91333
2:      F  20.00333
Joining Data with data.table in R

Computation with joins

Computation with joins:

DT1[DT2, on, j]
    |    |   | 
    |    |    ----> what to do on the join result?
    |     --------> using which columns as keys? 
     -------------> join to which data.table?

Efficient for large data.tables!

Joining Data with data.table in R

Joining and column creation

Column creation takes place in the main data.table:

customers[purchases, on = .(name), return_customer := sales > 1]
customers
       name gender age return_customer
1:     Mark      M  54           FALSE
2:     Matt      M  43            TRUE
3:   Angela      F  39            TRUE
4: Michelle      F  63            TRUE
Joining Data with data.table in R

Grouping by matches

by = .EACHI groups j by each row from DT2

DT1[DT2, on, j, by = .EACHI]
    |    |   |  |
    |    |   |   --> grouped by each match in DT1.
    |    |    -----> what to do on the join result?
    |     ---------> using which columns as keys? 
     --------------> join to which data.table?
Joining Data with data.table in R

Grouping by matches

shipping[customers, on = .(name), 
         j = .("# of shipping addresses" = .N),
         by = .EACHI]

Joining Data with data.table in R

Grouping by columns with joins

Grouping by columns in the by restricts computation to the main data.table:

DT1[DT2, on, j, by]
    |    |   |  |
    |    |   |   --> grouped by what columns in DT1?
    |    |    -----> what to do on columns in DT1?
    |     ---------> using which columns as keys? 
     --------------> join to which data.table?
Joining Data with data.table in R

Grouping by columns with joins

Join and calculate by group in customers:

customers[shipping, on = .(name), 
          .(avg_age = mean(age)), by = .(gender)]
   gender  avg_age
1:      M 46.66667
2:      F 39.00000
Joining Data with data.table in R

Let's practice!

Joining Data with data.table in R

Preparing Video For Download...