Joining Data with data.table in R
Scott Ritchie
Postdoctoral Researcher in Systems Genomics
# Which bacteria could be found at both sites using any method?
site1_ecology[site2_ecology, on = .(genus)]
site1_ecology[site2_ecology, on = .(genus)]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, :
Join results in 12 rows; more than 10 = nrow(x)+nrow(i). Check for
duplicate key values in i each of which join to the same group in x over
and over again. If that's ok, try by=.EACHI to run j for each group to
avoid the large allocation. If you are sure you wish to proceed, rerun
with allow.cartesian=TRUE. Otherwise, please search for this error message
in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
allow.cartesian = TRUE
allows the join to proceed:
# data.table syntax
site1_ecology[site2_ecology, on = .(genus), allow.cartesian = TRUE]
# merge()
merge(site1_ecology, site2_ecology, by = "genus", allow.cartesian = TRUE)
site1_ecology[site2_ecology, on = .(genus), allow.cartesian = TRUE]
genus count method present i.method
1: Nitrosomonas 500 WGS TRUE WGS
2: Nitrosomonas 620 16S TRUE WGS
3: Nitrosomonas 500 WGS TRUE 16S
4: Nitrosomonas 620 16S TRUE 16S
5: Nitrosomonas 500 WGS TRUE Culture
6: Nitrosomonas 620 16S TRUE Culture
7: Rhizobium 360 WGS TRUE WGS
8: Rhizobium 300 16S TRUE WGS
9: Rhizobium 360 WGS TRUE 16S
10: Rhizobium 300 16S TRUE 16S
11: Rhizobium 360 WGS FALSE Culture
12: Rhizobium 300 16S FALSE Culture
Missing values (NA
) will match all other missing values:
!is.na()
can be used to filter rows with missing values
site1_ecology <- site1_ecology[!is.na(genus)]
site1_ecology
genus count method
1: Nitrosomonas 500 WGS
2: Rhizobium 360 WGS
site2_ecology <- site2_ecology[!is.na(genus)]
site2_ecology
genus present method
1: Nitrosomonas TRUE Culture
2: Rhizobium TRUE Culture
3: Azotobacter TRUE Culture
site1_ecology[site2_ecology, on = .(genus), mult = "first"]
children[parents, on = .(parent = name), mult = "last"]
duplicated()
: what rows are duplicates?
unique()
: filter a data.table
to just unique rows
Using values in all columns:
duplicated(site1_ecology)
FALSE FALSE FALSE FALSE
Using values in a subset of columns:
duplicated(site1_ecology,
by = "genus")
FALSE TRUE FALSE TRUE
unique(site1_ecology, by = "genus")
fromLast = TRUE
changes the direction of the search to start from the last row
duplicated(site1_ecology, by = "genus", fromLast = TRUE)
TRUE FALSE TRUE FALSE
unique(site1_ecology, by = "genus", fromLast = TRUE)
Joining Data with data.table in R