Identify cases with multiple conditions in multiple columns in R

Asked

Viewed 171 times

4

I have a dataframe with 20 students and I need to identify the students who attended stage 43 for two years or more.

aluno <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
etapa_2012 <- c(42, 43, 44, 43, 42, 43, 44, 45, 42, 43, 44, 45, 42, 43, 44, 44, 42, 43, 44, 45)
etapa_2013 <- c(43, 44, 45, 43, 43, 44, 45, 45, 43, 43, 45, 45, 43, 44, 45, 44, 43, 44, 45, 45)
etapa_2014 <- c(44, 45, 45, 43, 44, 45, 45, 45, 44, 43, 45, 45, 44, 45, 45, 45, 44, 45, 45, 45)
etapa_2015 <- c(45, 45, 45, 44, 45, 45, 45, 44, 43, 45, 45, 45, 45, 45, 45, 44, 43, 45, 45, NA)
fluxo<-data.frame(aluno, etapa_2012, etapa_2013, etapa_2014, etapa_2015)

But I can only add a new column identifying the students who did step 43.

fluxo$dois_ou_mais <-ifelse(fluxo$etapa_2012==43|fluxo$etapa_2013==43|fluxo$etapa_2014==43|fluxo$etapa_2015==43, 1, 0)
fluxo

So I have the resultinserir a descrição da imagem aqui

I would like to arrive at the result where only students 4, 9, 10 and 17 were marked in the column dois_ou_mais, since they have stage 43 in more than a year, as shown below.inserir a descrição da imagem aqui

2 answers

6


Use the command

fluxo[, 2:5]==43

Thus, each position of columns 2 to 5 will be tested to see if they are equal to 43. Thus, an object will be created with TRUE and FALSE.

head(fluxo[, 2:5]==43)
     etapa_2012 etapa_2013 etapa_2014 etapa_2015
[1,]      FALSE       TRUE      FALSE      FALSE
[2,]       TRUE      FALSE      FALSE      FALSE
[3,]      FALSE      FALSE      FALSE      FALSE
[4,]       TRUE       TRUE       TRUE      FALSE
[5,]      FALSE       TRUE      FALSE      FALSE
[6,]       TRUE      FALSE      FALSE      FALSE

To the R, TRUE has value 1 and FALSE has value 0. So, just add the number of TRUE in each row:

apply(head(fluxo[, 2:5]==43), 1, sum)
[1] 1 1 0 3 1 1

To know who attended more than once, without worrying about the number of times that the person attended, use the command below:

as.numeric(apply(head(fluxo[, 2:5]==43), 1, sum)>1)
[1] 0 0 0 1 0 0

Remove the head of the solution I went through and everything will fit to solve your problem with the original size.

  • By the way, why don’t you as.integer?

  • I may be mistaken, but I believe there is no practical difference between as.integer and as.numeric in the R if numerical values are used within the R. If I needed to pass the result of this command to a C program, it would make a difference to turn the data into integer or floating point. Inside the R I guess it doesn’t matter if you use one function or the other.

3

using the function rowSums to get the table as you requested

fluxo$dois_ou_mais <- as.numeric(rowSums(fluxo[,-1] == 43, na.rm = TRUE) > 1)

But if you are interested in getting only students, without modifying the original table. I prefer to use tidyr and dplyr

library(tidyr)
library(dplyr)
fluxo %>% gather(key = ano, value = etapa, -aluno) %>% 
          filter(etapa == 43) %>% group_by(aluno) %>% 
          summarise( N = n()) %>% filter(N > 1)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.