r - Full outer join of multiple dataframes stored as elements of a list using data.table

Wednesday, 21 February 2018

r - Full outer join of multiple dataframes stored as elements of a list using data.table

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.

Here's a small example:

#Libraries
library("reshape")
library("data.table")

#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
                 data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
                 data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))

#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))

#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here

Reduce(function(x, y) x[y[J(unique_keys)]], filelist)

Here's my expected output:

> listMerged
   x y  a  b  c
1  1 1  1  0  0
2  1 2  2  5 10
3  1 3  3 10 20
4  2 1  4 15 30
5  2 2  5 20 40
6  2 3  6 25 50
7  3 1  7 30 60
8  3 2  8 35 70
9  3 3  9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80

Here are my resources:

Answer

This worked for me:

library("reshape")
library("data.table")
##
filelist <- list(
  data.frame(
    x=c(1,1,1,2,2,2,3,3,3), 
    y=c(1,2,3,1,2,3,1,2,3), 
    a=1:9),
  data.frame(
    x=c(1,1,1,2,2,2,3,3,4), 
    y=c(1,2,3,1,2,3,1,2,1), 
    b=seq(from=0, by=5, length.out=9)),
  data.frame(
    x=c(1,1,1,2,2,2,3,3,4), 
    y=c(1,2,3,1,2,3,1,2,2), 
    c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
  setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
  merge(x,y,all=T,allow.cartesian=T)
},dtList)
    x y  a  b  c
 1: 1 1  1  0  0
 2: 1 2  2  5 10
 3: 1 3  3 10 20
 4: 2 1  4 15 30
 5: 2 2  5 20 40
 6: 2 3  6 25 50
 7: 3 1  7 30 60
 8: 3 2  8 35 70
 9: 3 3  9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80

Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in

> dt
       filelist
1: 
2: 
3:

which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?

Reduce(function(x, y) x[y[J(unique_keys)]], filelist)

I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)

Error in eval(expr, envir, enclos) : could not find function "J"

which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by @Arun in this answer.

Blog

Wednesday, 21 February 2018