How to join observations of tables that have a different set of variables in R?

Asked

Viewed 2,494 times

5

In the case below id corresponds to observations contained in two different data frames, in df1 the observations have the variable x and in df2 the observations have x and z

In these cases how to join the two data frames? The conmando rbind doesn’t work

df1 <- data.frame(id=1:10, z = rnorm(10))
df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))

dados <- rbind(df1,df2)

Ideally, the column not present in all data frames should be added, assigning the NA value to observations that do not have this variable.

4 answers

4


In this case I always use the function bind_rows of dplyr:

library(dplyr)
dados <- bind_rows(df1,df2)

> dados
Source: local data frame [20 x 3]

      id          z           x
   (int)      (dbl)       (dbl)
1      1  0.8179472          NA
2      2  0.2624969          NA
3      3 -0.1684590          NA
4      4 -0.1239140          NA
5      5  0.4434778          NA
6      6 -0.8865578          NA
7      7  0.1160360          NA
8      8  0.5604733          NA
9      9 -2.2761215          NA
10    10 -0.7920775          NA
11    11  1.7650167 -1.38172797
12    12 -1.0004357  2.64345620
13    13 -1.6467084 -0.01361806
14    14  0.9055755  2.00354819
15    15 -0.1645952  0.57657614
16    16  0.2675339 -0.01727064
17    17  0.6383209 -0.43920834
18    18 -1.4729775 -0.35907320
19    19  0.9345417 -0.93673279
20    20 -0.7888048  0.36903134

I found it cool to put here a comparison of running time of all alternatives:

> microbenchmark(
+   base = merge(df1, df2, all = TRUE),
+   dplyr = dplyr::bind_rows(df1,df2),
+   data.table = data.table::rbindlist(list(df1,df2), fill = TRUE),
+   plyr = plyr::rbind.fill(df1,df2)  
+ )
Unit: microseconds
       expr      min        lq      mean    median        uq        max neval
       base 1370.788 1578.6680 2138.9646 1852.2805 2296.0775   8607.060   100
      dplyr   64.768  111.1450  205.0742  126.2580  161.3900   4055.948   100
 data.table  173.051  239.8905 2860.8464  280.5705  352.7535 253411.277   100
       plyr  362.365  440.6795  597.4301  506.5200  622.8745   4323.416   100

Note that the solution using dplyr is the fastest of all. More than 10x more than the base and about 2x faster than the solution by data.table. I’m comparing the medians!

3

You can use the function rbind.fill of the package plyr to do what you want:

library(plyr)
df1 <- data.frame(id=1:10, z = rnorm(10))
df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))

dados <- rbind.fill(df1,df2)

3

You can also use the function rbindlist package data.table, with the option fill = TRUE:

library(data.table)
rbindlist(list(df1,df2), fill = TRUE)

1

A solution of the base package itself is to use the merge. Although it is usually used to make the match between columns, also works as a rbind with Fill as long as you give the correct arguments. It is a good option, especially if using the other package (dplyr, data.table, plyr) is just this.

> df1 <- data.frame(id=1:10, z = rnorm(10))
> df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))
> merge(df1, df2, all = TRUE)
   id           z            x
1   1  0.55248779           NA
2   2  0.18582807           NA
3   3  0.15226974           NA
4   4 -1.01512686           NA
5   5 -0.13544547           NA
6   6  0.07923429           NA
7   7  0.13807574           NA
8   8  0.08524223           NA
9   9  1.28002174           NA
10 10  0.46765903           NA
11 11 -0.46550011  0.083700363
12 12  0.02691267  0.623578122
13 13  1.60678489 -1.043026698
14 14 -1.77340124  0.001289941
15 15  0.89638862  1.388973550
16 16  0.47108478  0.047770795
17 17 -0.08128863 -0.099346609
18 18  0.24915884 -1.423919181
19 19  0.75685907  1.822885566
20 20  0.48232639 -1.172792347

Browser other questions tagged

You are not signed in. Login or sign up in order to post.