Joining Data with data.table in R
Scott Ritchie
Postdoctoral Researcher in Systems Genomics
data.table expressions can be chained in sequence:
demographics[...][...]
General form of chaining a join:
DT1[DT2, on][i, j, by]
| | | | |
| | | | --> grouped by what?
| | | -----> what to do?
| | --------> on which rows?
| ------------> join key columns
-----------------> join to which data.table?
customers <- data.table(name = c("Mark", "Matt", "Angela", "Michelle"),
gender = c("M", "M", "F", "F"),
age = c(54, 43, 39, 63))
customers
name gender age
1: Mark M 54
2: Matt M 43
3: Angela F 39
4: Michelle F 63
purchases <- data.table(name = c("Mark", "Matt", "Angela", "Michelle"),
sales = c(1, 5, 4, 3),
spent = c(41.70, 41.78, 50.77, 60.01))
purchases
name sales spent
1: Mark 1 41.70
2: Matt 5 41.78
3: Angela 4 50.77
4: Michelle 3 60.01
customers[purchases,
on = .(name)][sales > 1,
j = .(avg_spent = sum(spent) / sum(sales)),
by = .(gender)]
gender avg_spent
1: M 13.91333
2: F 20.00333
Computation with joins:
DT1[DT2, on, j]
| | |
| | ----> what to do on the join result?
| --------> using which columns as keys?
-------------> join to which data.table?
Efficient for large data.tables
!
Column creation takes place in the main data.table
:
customers[purchases, on = .(name), return_customer := sales > 1]
customers
name gender age return_customer
1: Mark M 54 FALSE
2: Matt M 43 TRUE
3: Angela F 39 TRUE
4: Michelle F 63 TRUE
by = .EACHI
groups j
by each row from DT2
DT1[DT2, on, j, by = .EACHI]
| | | |
| | | --> grouped by each match in DT1.
| | -----> what to do on the join result?
| ---------> using which columns as keys?
--------------> join to which data.table?
shipping[customers, on = .(name),
j = .("# of shipping addresses" = .N),
by = .EACHI]
Grouping by columns in the by
restricts computation to the main data.table:
DT1[DT2, on, j, by]
| | | |
| | | --> grouped by what columns in DT1?
| | -----> what to do on columns in DT1?
| ---------> using which columns as keys?
--------------> join to which data.table?
Join and calculate by group in customers
:
customers[shipping, on = .(name),
.(avg_age = mean(age)), by = .(gender)]
gender avg_age
1: M 46.66667
2: F 39.00000
Joining Data with data.table in R