1
I have a date.frame with IBGE census data and a column called IND and I have another date.frame with a column IND and another column NEW_IND. The column NEW_IND contains the new codes for the industries per year, as can be seen below:
I’m wearing a left_join with my original basis, which contains 33,523 observations, but after using this function, the base gets 37,967. I would like to understand why the left_join adds so many lines and if there is a way not to add them.
Code I’m using:
teste <- left_join(sample, industry, by = c("YEAR", "IND"))
Man dput of the sample:
structure(list(YEAR = structure(c(1991L, 1991L, 1980L, 1970L,
2010L, 2010L, 1991L, 1980L, 1970L, 1991L, 2000L, 1970L, 2010L,
1991L, 1970L, 1970L, 1960L, 1970L, 2000L, 1980L, 1991L, 1970L,
2010L, 2010L, 1991L, 1970L, 1960L, 1980L, 1970L, 1980L, 1991L,
1991L, 2010L, 2000L, 1980L, 1980L, 1991L, 1970L, 1980L, 2000L,
2000L, 1980L, 1991L, 1960L, 1970L, 1980L, 1960L, 1980L, 1970L,
1970L), label = "Year", var_desc = "YEAR gives the year in which the census or survey was taken. For samples that span years, the midpoint or first year of the interval is reported."),
PERWT = structure(c(8.36, 9.41, 4, 3, 3.49, 9.69, 14.79,
4, 4, 4.17, 10.66, 4, 8.65, 4.46, 3, 4, 5, 4, 15.68, 4, 6.29,
4, 12.1, 7.97, 5.76, 4, 5, 4, 4, 4, 5.69, 8.24, 3.7, 9.32,
4, 5, 12.93, 4, 3, 7.48, 11.59, 5, 1.4, 5, 4, 3, 5, 3, 4,
3), label = "Person weight", var_desc = "PERWT indicates the number of persons in the actual population represented by the person in the sample.\n\nFor the samples that are truly weighted (see the comparability discussion), PERWT must be used to yield accurate statistics for the population.\n\nNOTE: PERWT has 2 implied decimal places. That is, the last two digits of the eight-digit variable are decimal digits, but there is no actual decimal in the data."),
IND = c(110, 340, 110, 330, 110, 110, 110, 110, 516, 12,
53030, 927, 68000, 200, 122, 122, 326, 414, 55030, 724, 200,
110, 87000, 48042, 20, 110, 110, 611, 116, 631, 110, 150,
85012, 70002, 251, 20, 110, 110, 25, 36090, 110, 541, 110,
110, 341, 110, 110, 525, 110, 118)), row.names = c(NA, -50L
), class = c("tbl_df", "tbl", "data.frame"))
dput of my data.frame with industry code:
structure(list(YEAR = c(1970, 2010, 1980, 2010, 2010, 1980, 2010,
2000, 1991, 1980, 1980, 2010, 1960, 2010, 1980, 2010, 1960, 1991,
1960, 1991, 1970, 1960, 2010, 1960, 1980, 2010, 1991, 1970, 2010,
1970, 2010, 2010, 1991, 1991, 2010, 1991, 2000, 2010, 2000, 1970,
1980, 1960, 1980, 1960, 2010, 2000, 1960, 2010, 1980, 2010),
IND = c(119, 25002, 59, 8002, 81020, 462, 10092, 53111, 120,
613, 130, 45010, 859, 48071, 250, 81011, 129, 261, 117, 614,
303, 123, 86002, 616, 180, 45020, 416, 932, 33001, 0, 20999,
85999, 37, 623, 1401, 612, 60999, 2000, 93030, 330, 610,
129, 726, 622, 1116, 15041, 0, 7002, 419, 48042), NEW_IND = c(111,
311, 129, 251, 752, 452, 335, 513, 312, 722, 313, 129, 851,
418, 333, 513, 111, 335, 111, 756, 254, 111, 713, 616, 316,
515, 417, 129, 513, 999, 326, 711, 129, 753, 111, 716, 129,
111, 129, 334, 717, 129, 851, 622, 111, 335, 999, 251, 421,
413)), row.names = c(NA, -50L), class = "data.frame")

Thanks for the explanation!
– Alexandre Sanches