How to verify if an ID has different information in a period of time?

Asked

Viewed 25 times

0

I have the following database: cnpj_root, affiliate (1 = matrix, 2 = affiliate), Uf, active_em_2017, active_em_2018, active_em_2019, active_em_2020 (NA= inactive, 1 = active).

What I need: identify, by Uf and year, the quantity in companies whose headquarters is in another state.

Former simple:

cnpj | Matriz_filial | Uf | 2017 | 2018 | 2019 | 2020
123       1            SP    1       1     1       1
123       2            BA    0       0     1       1 
123       2            BA    0       0     1       1 
123       2            SP    1       1     0       0 
456       1            SP    0       1     1       1
456       2            BA    0       0     1       0

BA = 2 in 2019 and 1 in 2020 (the matrix is in SP) SP = 0 (head office and subsidiary in the same state)

Expected result according to the sample basis:

BA = 1 in 2017, 1 in 2018, 1 in 2019, 0 in 2020 (has only subsidiaries in this state, the matrix is in another)

SP = 1 in all years (has only subsidiaries in that state, the head office is in another)

GO = 1 in all years (has only subsidiaries in that state, the parent company is in another)

All other = 0 in all years (headquarters and subsidiaries in the same state)

Then I need to plot a graph: y = Quantity, x = Year, group = Uf.

I tried to create a flag column to identify, but it’s not working.


empresa %>%
  tidyr::pivot_longer(
    cols = -c(cnpj_raiz, identificador_matriz_filial, uf),
    names_to = 'Ano'
  ) %>% 
  dplyr::select(cnpj_raiz, identificador_matriz_filial, uf, everything()) %>%
  dplyr::group_by(cnpj_raiz, identificador_matriz_filial, uf) %>% 
  dplyr::mutate(
    inclui_matriz_na_uf = all(1:2 %in% c(identificador_matriz_filial)),
    inclui_matriz_na_uf = ifelse(inclui_matriz_na_uf == T, 'Sim', 'Não')
    ) %>% 
  dplyr::mutate(Ano = stringr::str_extract(Ano, "[0-9].*")) %>% 
  dplyr::filter(inclui_matriz_na_uf == 'Não') 

Example of the basis:

structure(list(cnpj_raiz = c("6565656", "6565656", "77777","63547", "63547", "63547", "63547", "63547", "63547", "202102", "789654", "632145", "96896", "5555469", "5555469", 
"5555469", "6665554", "123321", "2703001", "270307171", "2220258", "878787", "878787", "234209", "717400", "737410", "98598","99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", 
"99992225", "99992225", "2252221", "96665123", "917800", "8920071", "9461800", "100553", "5571114", "7038010", "494001"), identificador_matriz_filial = c(1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,  2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), uf = c("PB", "PB", "RS", "PE", "PE", "PE", "PE", "PE", "PE", "SE", "BA", "AL", "BA", "PR", "PR", "PR", "RJ", "SP", "CE", "RJ",   "SP", "BA", "SP","SP", "SP", "MT", "PE", "GO", "GO", "GO", "GO",  "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "MA",  "GO", "PE", "PE", "MG", "SP", "PI", "SP", 
"RJ"), ativa_em_2017 = c(1L,  1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA,  NA, 1L, NA, 1L, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA,  1L), ativa_em_2018 = c(1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA,  1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA,  NA, 1L, NA, NA, NA, 1L, NA, NA), ativa_em_2019 = c(1L, 1L, NA,
1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L,  NA, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA), ativa_em_2020 = c(1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, 1, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA)), row.names = c(NA, 50L), class = "data.frame")

1 answer

1


I tried with 2 aggregations. The first with cnpj and Uf to create the flag column for pairs without affiliate. And in the second I summed up the values of flags by Uf and year, which are the information you want to plot.

empresa <- structure(list(cnpj_raiz = c("6565656", "6565656", "77777","63547", "63547", "63547", "63547", "63547", "63547", "202102", "789654", "632145", "96896", "5555469", "5555469", 
                                      "5555469", "6665554", "123321", "2703001", "270307171", "2220258", "878787", "878787", "234209", "717400", "737410", "98598","99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", "99992225", 
                                      "99992225", "99992225", "2252221", "96665123", "917800", "8920071", "9461800", "100553", "5571114", "7038010", "494001"), identificador_matriz_filial = c(1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,  2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), uf = c("PB", "PB", "RS", "PE", "PE", "PE", "PE", "PE", "PE", "SE", "BA", "AL", "BA", "PR", "PR", "PR", "RJ", "SP", "CE", "RJ",   "SP", "BA", "SP","SP", "SP", "MT", "PE", "GO", "GO", "GO", "GO",  "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "MA",  "GO", "PE", "PE", "MG", "SP", "PI", "SP", 
                                                                                                                                                                                                                                                                                                                                                                                                                "RJ"), ativa_em_2017 = c(1L,  1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA,  NA, 1L, NA, 1L, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA,  1L), ativa_em_2018 = c(1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA,  1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA,  NA, 1L, NA, NA, NA, 1L, NA, NA), ativa_em_2019 = c(1L, 1L, NA,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 1, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L,  NA, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA), ativa_em_2020 = c(1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, 1, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA)), row.names = c(NA, 50L), class = "data.frame")
library(magrittr)
empresa %>%
  tidyr::pivot_longer(
    cols = -c(cnpj_raiz, identificador_matriz_filial, uf),
    names_to = 'Ano'
  ) %>% 
  tidyr::drop_na(value) %>%  #Remove os anos não ativos
  dplyr::select(cnpj_raiz, identificador_matriz_filial, uf, everything()) %>%
  dplyr::group_by(cnpj_raiz, uf) %>% 
  dplyr::mutate(
    uf_sem_matriz = ifelse(any(identificador_matriz_filial == 1, na.rm = TRUE),
                                 0, 1)
    ) %>% 
  dplyr::group_by(Ano, uf) %>% 
  dplyr::summarize(
    qtd_filiais = sum(uf_sem_matriz, na.rm=TRUE),
    .groups = "drop"
  ) %>% 
  dplyr::mutate(Ano = stringr::str_extract(Ano, "[0-9].*")) %>% 
  dplyr::filter(qtd_filiais != 0)
#> # A tibble: 11 x 3
#>    Ano   uf    qtd_filiais
#>    <chr> <chr>       <dbl>
#>  1 2017  BA              1
#>  2 2017  GO              7
#>  3 2017  SP              1
#>  4 2018  BA              1
#>  5 2018  GO              8
#>  6 2018  SP              1
#>  7 2019  BA              1
#>  8 2019  GO              8
#>  9 2019  SP              1
#> 10 2020  GO              8
#> 11 2020  SP              1

Created on 2021-05-20 by the reprex package (v2.0.0)

  • Thank you! That’s almost it. The problem is this: all subsidiaries of the same company must count as 1 in the same state/year. Example: in GO despite having 8 subsidiaries, counts only as 1 company.

  • 1

    Oops, I solved it! I added one unique() between the group_by() and the summarize(). Thanks!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.