Speed in crossing tables - R

Asked

Viewed 139 times

3

good night!

I cross two bases in Rstudio, using the merge, however, I would like to know if using another crossover medium (ex:left_join), I get faster, because my tables reach 8 million lines.

Thank you.

1 answer

4


Ronaldo, it’s all right?

Check out this experiment by comparing the merge() function with the inner_join() function of the dplyr package.

# Garantindo a reprodução dos resultados aleatórios
set.seed(101)

# Gerando dois datasets com 8.000.000 de observações para exemplo
df1 <- data.frame(x = sample(seq(1,16000000,1),8000000),
                  y = sample(seq(1,16000000,1),8000000),
                  z = sample(seq(1,16000000,1),8000000))

df2 <- data.frame(x = sample(seq(1,16000000,1),8000000),
                  y = sample(seq(1,16000000,1),8000000),
                  z = sample(seq(1,16000000,1),8000000))


# Testando a função merge()
system.time(dfa <- merge(df1, df2, by = c("x", "y")))

#    user  system elapsed 
# 115.911  2.563  122.016 


# Testando a função inner_join()
library(dplyr)
system.time(dfb <- inner_join(df1, df2, by = c("x", "y")))

 #   user  system elapsed 
 # 16.459   0.966  17.833

Note that on my machine the merge function took 122 seconds to complete the operation, while the inner_join function took only 17 seconds.

  • 1

    sample(16000000, 8000000) is simpler and gives the same numbers. A only difference is that its version gives class vectors numeric (61Mb) and this class integer (30.5Mb). Try, and test with identical() and all.equal().

  • @Noisy thanks for the tip.

  • 1

    Thanks Antonio, I’ll change my code and put the result here.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.