Direct (and beautiful) solution to fix base using dplyr

Asked

Viewed 133 times

4

I have the following basis of defaulters:

df <- data.frame(
  lead_15 = c(1,0,0,0,0,1,0,0,1,0,0,0,0,0,1),
  lead_30 = c(0,0,0,1,0,0,1,1,0,1,0,0,0,1,0),
  lead_60 = c(0,1,0,0,1,0,0,0,0,0,1,1,0,0,0),

  inib_15 = c(1,0,0,0,0,0,0,0,1,0,0,0,0,0,0),
  inib_30 = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0),
  inib_60 = c(0,0,0,0,1,0,0,0,0,0,1,1,0,0,0),

  motivo_15 = c("A","","","","","","","","D","","","","","",""),
  motivo_30 = c("","","","B","","","A","","","","","","","B",""),
  motivo_60 = c("","","","","C","","","","","","B","D","","","")
)

I want a solution where there is a row for each lead (3 lines) where the first column is the sum of the respective lead, sum of the respective Inib and a column for each motif (A, B, C, D) where has the amount of these motifs.

LEAD    | QTD | INIB | A | B | C | D |
--------|-----|------|---|---|---|---|
lead_15 |  4  |  2   | 1 | 0 | 0 | 1 |
--------|-----|------|---|---|---|---|
lead_30 |  5  |  3   | 1 | 2 | 0 | 1 |
--------|-----|------|---|---|---|---|
lead_60 |  4  |  3   | 0 | 1 | 1 | 1 |

It’s a relatively simple problem that I can solve but with a lot of code pieces and separate accounts. I wanted to ask here because I know there can be a direct solution using the dplyr.

2 answers

3

I don’t know what exactly you mean by direct solution, but it follows a solution using dplyr and tidyr in a (long) line.

df %>% mutate_at(vars(starts_with("motivo")), 
                 funs(A = if_else(. == "A", 1, 0), 
                      B = if_else(. == "B", 1, 0), 
                      C = if_else(. == "C", 1, 0), 
                      D = if_else(. == "D", 1, 0))) %>%
  select(-matches("motivo_\\d{2}$")) %>%
  gather %>% mutate(key = gsub("(.+)(_)(\\d{2})_(.$)", "\\1\\4_\\3", key)) %>%
  separate(key, c("tipo", "grupo")) %>%
  group_by(tipo, grupo) %>% summarise(value = sum(value)) %>% spread(tipo, value)

# A tibble: 3 x 7
grupo  inib  lead motivoA motivoB motivoC motivoD
* <chr> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    15     2     4       1       0       0       1
2    30     3     5       1       2       0       0
3    60     3     4       0       1       1       1
  • I didn’t specify right, but what I meant was a solution with dplyr functions, so that it already built the foundation for all leads, not one at a time, full of accounts, ifs and so on, which was how I was doing... Even more pq has many other variables besides these. And I didn’t know about tidyr, it contains many functions that were what I was looking for.

2


I couldn’t think of a way to do it in just one expression. But I don’t think the following organization is bad.

motivo <- df %>%
  select(starts_with("motivo")) %>%
  gather(key, motivo) %>%
  separate(key, c('x', 'grupo')) %>%
  filter(motivo != "") %>%
  group_by(grupo, motivo) %>%
  summarise(n = n()) %>%
  spread(motivo, n, fill = 0)

inib <- df %>%
  select(starts_with("inib")) %>%
  gather(key, inib) %>%
  filter(inib != 0) %>%
  separate(key, c('x', 'grupo')) %>%
  group_by(grupo) %>%
  summarise(inib = sum(inib))

qtd <- df %>%
  select(starts_with("lead")) %>%
  gather(key, lead) %>%
  separate(key, c('x', 'grupo')) %>%
  group_by(grupo) %>%
  summarise(inib = sum(lead))

final <- left_join(qtd, inib) %>% left_join(motivo)

Of course, if you know that the reasons will always be "A", "B", "C" and "D", The @Fernando solution is better. This solution assumes that the number of motifs can be variable depending on the basis, as well as the number of "lead types".

  • Excellent answer. I also always try to solve them in a 'generic' way, so I don’t have to write the variables in the hand, and if there is a change in the base the code remains the same. Just one more question: Since each row has only 1 lead (it is 15, 30 or 60), the 'reason' variable could be only one column (without being explicit if it is 15, 30 or 60). In this answer it would be much more complex? What would be the split of the motif column in 3 columns?

  • @Thebiro I think nothing changes in the answer! As long as the column has the name started with "reason".

Browser other questions tagged

You are not signed in. Login or sign up in order to post.