Left_join returning dataframe with more lines than the original

Question

Left_join returning dataframe with more lines than the original

Asked 4 years, 1 month ago

Viewed 43 times

1

I have a date.frame with IBGE census data and a column called IND and I have another date.frame with a column IND and another column NEW_IND. The column NEW_IND contains the new codes for the industries per year, as can be seen below:

I’m wearing a left_join with my original basis, which contains 33,523 observations, but after using this function, the base gets 37,967. I would like to understand why the left_join adds so many lines and if there is a way not to add them.

Code I’m using:

teste <- left_join(sample, industry, by = c("YEAR", "IND"))

Man dput of the sample:

structure(list(YEAR = structure(c(1991L, 1991L, 1980L, 1970L, 
2010L, 2010L, 1991L, 1980L, 1970L, 1991L, 2000L, 1970L, 2010L, 
1991L, 1970L, 1970L, 1960L, 1970L, 2000L, 1980L, 1991L, 1970L, 
2010L, 2010L, 1991L, 1970L, 1960L, 1980L, 1970L, 1980L, 1991L, 
1991L, 2010L, 2000L, 1980L, 1980L, 1991L, 1970L, 1980L, 2000L, 
2000L, 1980L, 1991L, 1960L, 1970L, 1980L, 1960L, 1980L, 1970L, 
1970L), label = "Year", var_desc = "YEAR gives the year in which the census or survey was taken. For samples that span years, the midpoint or first year of the interval is reported."), 
    PERWT = structure(c(8.36, 9.41, 4, 3, 3.49, 9.69, 14.79, 
    4, 4, 4.17, 10.66, 4, 8.65, 4.46, 3, 4, 5, 4, 15.68, 4, 6.29, 
    4, 12.1, 7.97, 5.76, 4, 5, 4, 4, 4, 5.69, 8.24, 3.7, 9.32, 
    4, 5, 12.93, 4, 3, 7.48, 11.59, 5, 1.4, 5, 4, 3, 5, 3, 4, 
    3), label = "Person weight", var_desc = "PERWT indicates the number of persons in the actual population represented by the person in the sample.\n\nFor the samples that are truly weighted (see the comparability discussion), PERWT must be used to yield accurate statistics for the population.\n\nNOTE: PERWT has 2 implied decimal places.  That is, the last two digits of the eight-digit variable are decimal digits, but there is no actual decimal in the data."), 
    IND = c(110, 340, 110, 330, 110, 110, 110, 110, 516, 12, 
    53030, 927, 68000, 200, 122, 122, 326, 414, 55030, 724, 200, 
    110, 87000, 48042, 20, 110, 110, 611, 116, 631, 110, 150, 
    85012, 70002, 251, 20, 110, 110, 25, 36090, 110, 541, 110, 
    110, 341, 110, 110, 525, 110, 118)), row.names = c(NA, -50L
), class = c("tbl_df", "tbl", "data.frame"))

dput of my data.frame with industry code:

structure(list(YEAR = c(1970, 2010, 1980, 2010, 2010, 1980, 2010, 
2000, 1991, 1980, 1980, 2010, 1960, 2010, 1980, 2010, 1960, 1991, 
1960, 1991, 1970, 1960, 2010, 1960, 1980, 2010, 1991, 1970, 2010, 
1970, 2010, 2010, 1991, 1991, 2010, 1991, 2000, 2010, 2000, 1970, 
1980, 1960, 1980, 1960, 2010, 2000, 1960, 2010, 1980, 2010), 
    IND = c(119, 25002, 59, 8002, 81020, 462, 10092, 53111, 120, 
    613, 130, 45010, 859, 48071, 250, 81011, 129, 261, 117, 614, 
    303, 123, 86002, 616, 180, 45020, 416, 932, 33001, 0, 20999, 
    85999, 37, 623, 1401, 612, 60999, 2000, 93030, 330, 610, 
    129, 726, 622, 1116, 15041, 0, 7002, 419, 48042), NEW_IND = c(111, 
    311, 129, 251, 752, 452, 335, 513, 312, 722, 313, 129, 851, 
    418, 333, 513, 111, 335, 111, 756, 254, 111, 713, 616, 316, 
    515, 417, 129, 513, 999, 326, 711, 129, 753, 111, 716, 129, 
    111, 129, 334, 717, 129, 851, 622, 111, 335, 999, 251, 421, 
    413)), row.names = c(NA, -50L), class = "data.frame")

2 answers

2

It is hard to know exactly what is happening without seeing your complete data. But perhaps it is as follows: When you run left_join(A, B), all matching combinations between A and B are returned. In other words, new lines are added to your new data frame whenever there are multiple lines in B which correspond to the same line in A. For example:

library(dplyr)

A <- data.frame(col1 = letters[1:4],
                col2 = 1:4)
B <- data.frame(col1 = rep(letters[1:2], 2),
                col3 = 4:1)

left_join(A, B, by = "col1")

Its result has 6 lines, and A has only 4.

Thanks for the explanation!

– Alexandre Sanches

2021/06/09 at 14:38

Browser other questions tagged r dplyr tidyverse

You are not signed in. Login or sign up in order to post.

by Carlos Eduardo Lagosta • **5,497** points · Answer 1 · 2021-06-09T05:45:03+00:00

As indicated in reply from ALS.Meyer, If in some year a code has been debunked into two or more (something common in official classification systems), then the union will return more lines than the original. From your data sample, it is not possible to know if this is the case, but here is an example:

library(dplyr)

dados <- data.frame(perwt = 1:4, ind = 101:104)

codigos <- data.frame(ind = c(101, 101:104), new_ind = 200:204)

left_join(dados, codigos, "ind")
#>     perwt ind new_ind
#> 1     1 101     200
#> 2     1 101     201
#> 3     2 102     202
#> 4     3 103     203
#> 5     4 104     204

Note that the line corresponding to Ind == 101 has been duplicated. If you need the exact number of rows of the first data frame. and have no problem working with codes such as strings later, one option is to use nest_join to aggregate the duplicate values:

nest_join(dados, codigos, by = "ind", name = "new_ind")
#>   perwt ind  new_ind
#> 1     1 101 200, 201
#> 2     2 102      202
#> 3     3 103      203
#> 4     4 104      204

Or, using only R base:

cod.a <- aggregate(new_ind ~ ind, codigos, paste)

merge(cod.a, dados, by = "ind", all.y = TRUE)
#>   ind  new_ind perwt
#> 1 101 200, 201     1
#> 2 102      202     2
#> 3 103      203     3
#> 4 104      204     4