How to apply filters to a data.frame?

Asked

Viewed 5,243 times

4

I have a database in Excel with 12135 data and not all fields are filled and some individuals have negative age and others are older than 17 years and others older than 70 years.

I need to know which functions/filters of SOFTWARE R to resolve these two issues:

1° Eliminate all those who have some unfilled data (e.g.: I have table of age, sex, organic unit, course, year of entry.Within the sex table, the sex of some students is not known; and others do not know the course that attends so on. I need to delete all data related to these students and work only with students who have all fields filled);

2° Present all information of students aged 17 to 70 years, that is, the others are deleted.

  • Friend, I do not know the R language, but I have a great knowledge in Python! I believe that R, from what I have heard in lectures, is not very effective in your case for having too many lines to process and would take longer than in Python. I recommend the Panda library in Python to read and make your dataset.

  • Welcome to Sopt! Read on How to ask and mcve to know how to elaborate a question of easy understanding and so can be helped. If you put the code you already have (data frame, columns, etc.) it will be much easier for the community to help you.

  • 1

    @Viniciusmesel, R can easily handle database with ~12k items; I’ve worked with data frames of over 1 million data smoothly.

  • 1

    Evandro, even if this can be easily done in R, you can also do it without difficulties in Excel using the filter tool. If your goal is to use R only to perform this filter, I think there is no need.

  • Evandro a suggestion, even if in general you have been clear, whenever possible provide a representative sample of these data, that facilitates the understanding and life that will answer the question. Also, if the case is just to filter the data, I agree with Molx, the Excel of this account.

  • Good afternoon. I don’t know how to post this example here. It is pa a requested job at school and one of the teacher’s requirements is that it be done in the R. Show me how to post an example in the OS

Show 1 more comment

1 answer

1

Considering:

dados<-data.frame(idade=c(15,18,25,40,85,NA),
              sexo=c("M","F",NA,"F","M","M"),
              unidade.organica=c("EMEI CG","USP",NA,"UFSM","UFRGS","UPF"),
              curso=c("TÉCNICO","SUPERIOR",NA,"SUPERIOR","SUPERIOR",NA),
              ano.ingresso=c(2005,2011,NA,2014,1980,2015))
#exibindo o data.frame criado
dados
  idade sexo unidade.organica    curso ano.ingresso
1    15    M          EMEI CG  TÉCNICO         2005
2    18    F              USP SUPERIOR         2011
3    25 <NA>             <NA>     <NA>           NA
4    40    F             UFSM SUPERIOR         2014
5    85    M            UFRGS SUPERIOR         1980
6    NA    M              UPF     <NA>         2015

NOTE: Take into account that your missing data is represented by NA.

##Filtro dos dados ausentes (NA):
#Removendo os NA's com função na.omit()
dada.sem.NA<-na.omit(dados)

#Removendo NA's com a função indexadora which():
dados.sem.NA<-dados[-unique(which(is.na(dados),arr.ind = T)[,1]),]

For both functions:which() or na.omit(). The result is:

 dados.sem.NA    
      idade sexo unidade.organica    curso ano.ingresso
    1    15    M          EMEI CG  TÉCNICO         2005
    2    18    F              USP SUPERIOR         2011
    4    40    F             UFSM SUPERIOR         2014
    5    85    M            UFRGS SUPERIOR         1980 

The age filter can be applied in any variable dados or in dados.sem.NA, see the cases:

#Filtro de idade na variável dados:
dados.por.idade<-dados[(dados.sem.NA$idade>17 & dados.sem.NA$idade<70), ]

The result is:

dados.por.idade
     idade sexo unidade.organica    curso ano.ingresso
   2    18    F              USP SUPERIOR         2011
   3    25 <NA>             <NA>     <NA>           NA
   6    NA    M              UPF     <NA>         2015

 #Filtro de idade na variável dados.sem.NA:
 dados.por.idade<-dados.sem.NA[(dados.sem.NA$idade>17 & dados.sem.NA$idade<70), ]

The result is:

dados.por.idade
    idade sexo unidade.organica    curso ano.ingresso
  2    18    F              USP SUPERIOR         2011
  4    40    F             UFSM SUPERIOR         2014

I hope I helped. Good luck!

  • 1

    Jean, I think you can simplify the code a little bit. To omit the NA's you can use na.omit(dados). To filter by age, there is no need of which, just put the two conditions.

  • Good evening, Jean. Thanks in advance, because your help has solved me a part of the problem. In this example: I want an is.na that eliminates all NA fields in qq column or row. Your code only eliminates the age line and I need a q also eliminates the string d letter, word or phrase. The age filter is giving error, pd it prints the meals variavesi and not the requested ages

  • It only eliminates AN by being n° and I need a code that eliminates the letters, words or phrases from the other columns. In the age filter this code prints the variables sex, age, organic unit and need a code q shows all students who are aged from 17 to 70 years.

  • @Molx, I’ll see how it gets better. -Evandro Lopes, as in your question you did not make explicit what/if one operation depended on the other, I assumed that the second one would be based on the results of the first one, I will try to make this more flexible.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.