Concatenating data.tables

Joining Data with data.table in R

Scott Ritchie

Postdoctoral Researcher in Systems Genomics

Same columns, different data.tables

Concatenating data.tables

Joining Data with data.table in R

Concatenation functions

rbind(): concatenate rows from data.tables stored in different variables

rbindlist(): concatenate rows from a list of data.tables

Joining Data with data.table in R

The rbind() function

Concatenate two or more data.tables stored as variables

# ... takes any number of arguments
rbind(...) 
rbind(sales_2015, sales_2016)
   quarter  amount
1:       1 3200100
2:       2 2950000
3:       3 2980700
4:       4 3420000
5:       1 3350000
6:       2 3000300
7:       3 3120200
8:       4 3670000
Joining Data with data.table in R

Adding an identifier column

The idcol argument adds a column indicating the data.table of origin

rbind("2015" = sales_2015, "2016" = sales_2016, idcol = "year")
   year quarter  amount
1: 2015       1 3200100
2: 2015       2 2950000
3: 2015       3 2980700
4: 2015       4 3420000
5: 2016       1 3350000
6: 2016       2 3000300
7: 2016       3 3120200
8: 2016       4 3670000
Joining Data with data.table in R

Adding an identifier column

rbind(sales_2015, sales_2016, idcol = "year")
   year quarter  amount
1:    1       1 3200100
2:    1       2 2950000
3:    1       3 2980700
4:    1       4 3420000
5:    2       1 3350000
6:    2       2 3000300
7:    2       3 3120200
8:    2       4 3670000
Joining Data with data.table in R

Adding an identifier column

rbind(sales_2015, sales_2016, idcol = TRUE)
   .id quarter  amount
1:   1       1 3200100
2:   1       2 2950000
3:   1       3 2980700
4:   1       4 3420000
5:   2       1 3350000
6:   2       2 3000300
7:   2       3 3120200
8:   2       4 3670000
Joining Data with data.table in R

Handling missing columns

rbind("2015" = sales_2015, "2016" = sales_2016, idcol = "year", 
      fill = TRUE)

Joining Data with data.table in R

Handling missing columns

rbind(sales_2015, sales_2016, idcol = "year")
Error in rbindlist(l, use.names, fill, idcol) : 
  Item 2 has 3 columns, inconsistent with item 1 which has 2 columns. 
  If instead you need to fill missing columns, use set argument 'fill' 
  to TRUE.
Joining Data with data.table in R

The rbindlist() function

Concatenate rows from a list of data.tables

# Read in a list of data.tables
table_files <- c("sales_2015.csv", "sales_2016.csv")
list_of_tables <- lapply(table_files, fread)
rbindlist(list_of_tables)
   quarter  amount
1:       1 3200100
2:       2 2950000
3:       3 2980700
4:       4 3420000
5:       1 3350000
6:       2 3000300
7:       3 3120200
8:       4 3670000
Joining Data with data.table in R

Adding an identifier column

The idcol argument takes names from the input list

names(list_of_tables) <- c("2015", "2016")
rbindlist(list_of_tables, idcol = "year")
   year quarter  amount
1: 2015       1 3200100
2: 2015       2 2950000
3: 2015       3 2980700
4: 2015       4 3420000
5: 2016       1 3350000
6: 2016       2 3000300
7: 2016       3 3120200
8: 2016       4 3670000
Joining Data with data.table in R

Handling different column orders

rbind("2015" = sales_2015, "2016" = sales_2016, idcol = "year", 
      use.names = TRUE)

Joining Data with data.table in R

`data.tables` with different column names

rbind("2015" = sales_2015, "2016" = sales_2016, idcol = "year", 
      use.names = FALSE)

Joining Data with data.table in R

Pitfalls of `use.names = FALSE`

rbind("2015" = sales_2015, "2016" = sales_2016, idcol = "year", 
      use.names = FALSE)

Joining Data with data.table in R

Differing defaults

  • Default for rbind() is use.names = TRUE
  • Default for rbindlist() is use.names = FALSE unless fill = TRUE.
Joining Data with data.table in R

Let's practice!

Joining Data with data.table in R

Preparing Video For Download...