How to apply the `Join` (dplyr) function in a list?

Asked

Viewed 123 times

2

I tried in many ways to apply the functions join of package dplyr in a list but I can’t do this. With the function Reduce and merge This is easy to do, but my intention is to do it with join as this is a faster function.

With Reduce and merge the code looks like this:

newdata <- Reduce(function(x,y) merge(x, y, all = FALSE, by = 'row.names', 
                                      incomparables = NA, sort = FALSE),
                  mylist)

The goal would be to join the data.frames of mylist in a single database.

dput (list) for aid in response:

mylist=structure(list(period1 = structure(list(sell = 
c(954.82455776073, 
510.810676729307, 744.75243431516, 740.655287634581, 993.685934110545, 
866.834087180905, 523.575691389851, 764.874521177262, 
817.054593935609, 
518.385569332168, 531.494156224653, 843.328540329821, 
795.584754319862, 
938.056216109544, 710.636245436035, 997.307573445141, 
869.842965039425, 
771.45393146202, 823.295841924846, 601.494735921733), place = 
structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("a", "b"), class = "factor")), .Names = 
c("sell", 
"place"), row.names = c(NA, -20L), class = "data.frame"), period10 = 
structure(list(
sell = c(733.160433010198, 965.176168596372, 773.615662241355, 
871.220104396343, 850.882120081224, 548.022049595602, 
621.362034813501, 
814.990229788236, 549.271885422058, 695.817611529492, 
881.452074856497, 
993.035112507641, 771.279759705067, 992.838160018437, 
774.461645982228, 
517.854797886685, 637.189441244118, 623.788836062886, 
780.715740052983, 
712.462180759758), place = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor")), .Names = c("sell", 
"place"), row.names = c(NA, -20L), class = "data.frame"), period11 = 
structure(list(
sell = c(533.743410953321, 773.384398664348, 872.096808976494, 
842.97708561644, 666.135999024846, 693.793271319009, 863.513759453781, 
539.360933355056, 997.354788240045, 752.822959562764, 
671.030344441533, 
731.102725607343, 580.205574864522, 667.002705973573, 
630.402681184933, 
899.911671527661, 872.772023198195, 543.33548923023, 682.689820649102, 
778.205765294842), place = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor")), .Names = c("sell", 
"place"), row.names = c(NA, -20L), class = "data.frame"), period12 = 
structure(list(
sell = c(947.833278682083, 588.600017828867, 992.308593471535, 
763.031872571446, 809.284279122949, 798.120932886377, 
645.494438474998, 
773.96222949028, 880.193093093112, 681.19038187433, 962.226237053983, 
865.827421075664, 837.468956946395, 975.754468236119, 
643.590759718791, 
701.593282399699, 717.885117628612, 986.66607378982, 866.434537689202, 
718.550421763211), place = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor")), .Names = c("sell", 
"place"), row.names = c(NA, -20L), class = "data.frame")), .Names = 
c("period1", 
"period10", "period11", "period12"))

Consider by='rownames' in function.

And, although mylist has been data.frames with equal numbers of lines, please consider that these have numbers different. This is why use cbind recycle, that is, the repetition of initial elements of one vector to complete the other with larger number of lines.

  • You want to do join with what? The join is on the list or each data.frame in the list? If only add the columns to the side (what is suggested by the use of by = "rownames"), cbind() is not a more appropriate solution?

  • I edited the question. I beg your pardon if I was unclear.

  • For comparison purposes, could you share in the question the code you used to Reduce() and merge()?

  • 1

    I put the code in the question.

1 answer

3


In the , the package used for handling lists is the . To solve the proposed problem we will have to use this package together with the .

First, let me propose a code change to simplify our ratings:

merge2 <- function(x, y) {
  merge(x, y, "row.names", sort = FALSE, incomparables = NA)
}
newdata2 <- Reduce(merge2, mylist)
identical(newdata, newdata2)
# [1] TRUE

In the we have the functions purrr::map() - similar to the base::lapply() - and purrr::reduce() - similar to the base::Reduce(). In the proposed solution, we will first create a column with the row.names in each data.frame, to then reduce them to a single data.frame.

library(tidyverse)

mylist %>% 
  map(~ mutate(.x, rownames = row.names(.x))) %>% 
  reduce(inner_join, by = "rownames") %>% 
  head()

    sell.x place.x rownames   sell.y place.y sell.x.x place.x.x sell.y.y place.y.y
1 954.8246       a        1 733.1604       a 533.7434         a 947.8333         a
2 510.8107       a        2 965.1762       a 773.3844         a 588.6000         a
3 744.7524       a        3 773.6157       a 872.0968         a 992.3086         a
4 740.6553       a        4 871.2201       a 842.9771         a 763.0319         a
5 993.6859       a        5 850.8821       a 666.1360         a 809.2843         a
6 866.8341       a        6 548.0220       a 693.7933         a 798.1209         a

Note: The allows you to use this formula notation when defining the function. Just type the code and use .x where the argument iterated by map().

Since your goal is to speed things up, how the proposed solution plays out?

library(microbenchmark)

microbenchmark(
  base_reduce = Reduce(merge2, mylist),
  purrr_merge = purrr::reduce(mylist, merge2),
  purrr_join = mylist %>% 
    map(~ mutate(.x, row.names = row.names(.x))) %>% 
    reduce(inner_join, by = "row.names")
)

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval cld
  base_reduce 2.866133 2.952605 3.040672 2.997764 3.099633 3.458348   100 a  
  purrr_merge 2.887687 3.013930 3.263133 3.086803 3.213559 7.019844   100  b 
   purrr_join 4.994308 5.192140 5.539170 5.309402 5.516985 8.808804   100   c

Not very well . This happens because the data.frameused are very small and cannot properly take advantage of the advantages of and of .

Let’s see what happens when we have data of larger and different sizes.

criar_dados <- function(n) {
  data_frame(sell = rnorm(n, mean = 800, 100),
             place = rep(c("a", "b"), each = n/2))
}

set.seed(1)

mylist2 <- list(
  period1 = criar_dados(40000),
  period10 = criar_dados(10000),
  period11 = criar_dados(5000),
  period12 = criar_dados(10000)
)

mylist3 <- mylist2 %>% 
  map(~ mutate(.x, row.names = row.names(.x)))

microbenchmark(
  base_reduce = Reduce(merge2, mylist2),
  purrr_merge = purrr::reduce(mylist2, merge2),
  purrr_join = mylist2 %>% 
    map(~ mutate(.x, row.names = row.names(.x))) %>% 
    reduce(left_join, by = "row.names"),
  sem_map = reduce(mylist3, left_join, by = "row.names")
) 

Unit: milliseconds
        expr      min       lq     mean   median       uq       max neval cld
 base_reduce 49.55766 51.29197 56.82020 52.28421 56.13617 206.06183   100   c
 purrr_merge 49.68185 51.34842 53.86253 52.62342 56.48257  64.07410   100   c
  purrr_join 42.35513 43.34814 45.68465 44.27135 47.79180  54.17221   100  b 
     sem_map 20.10655 20.58868 22.42495 21.14190 24.24563  32.57692   100 a  

See that with sets of 40 thousand and 10 thousand lines, the performance of is already beginning to stand out.

It is worth noting that the comparison of the third case with the first two is not very fair because we are also taking the trouble to create a new column with the name of the lines. It seems to me that this will not be the case for actual use. In case we could delete the line with the map the solution (that is, to rely on the id to accomplish the Join in the data set itself), as in the latter case the performance becomes much better.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.