Comparing contents of a Column

Asked

Viewed 2,923 times

4

I’m a beginner in R and needed help comparing column content.

First I sorted my table according to a specific column. For this I used the following function:

 x = fread ("x.txt",sep=";")
 x_ordenado = x[order(x$V3),]

I’m working with files that have a certain 5 million lines, only I need to reduce this number. One way would be to eliminate the data that equals a list of 10450 items. That is, in those 5 million lines I have a column with equal values and different from this list.

Any idea what I can do?

thank you

2 answers

2


You can do this more than one way in R. The simplest way would be to use the %in% to check which of your values are not in the list of values you want to remove. For example:

> todos <- 1:10 #Seus dados, números de 1 a 10
> excluir <- c(2,3,5,7) #Valores que serão removidos
> todos[!todos %in% excluir] #Faz um subset dos valores não-contidos em excluir
[1]  1  4  6  8  9 10

This approach does not seem to me to be heavy even for this amount of data, but another alternative would be to use the filter of dplyr, that would look like this:

> library(dplyr)
> df <- data.frame(todos) #Transformando em dataframe
> df %>% filter(! todos %in% excluir)
  todos
1     1
2     4
3     6
4     8
5     9
6    10

If you’re nesting other commands, the dplyr may be a good alternative, otherwise there is no need to load the package just for that.

This would remove your unwanted values, but I don’t think it would result in an improvement in data handling as you would only remove 0.2% of the lines. It may be possible to improve the code at other points to improve the steps that are actually slow, rather than reducing the size of the data.

  • Thank you for your answer! I did two tests with the first possibility, but for some reason it didn’t work. Try 1 I read my file which is a table in a variable. Then read the list of values you want to keep in another variable. , and then the command === all[! all %in% delete] but it didn’t work Try 2 Then I separated the file into columns, that is, I took the inter-column approach, but it didn’t work either. The second possibility results in the following error: Error in Usemethod("filter_") : no applicable method for 'filter_' Applied to an Object of class "Logical"

  • Can you include in the question an example of your data? Ideally the result of dput(head(x_ordenado)) and the same thing for list values you want to delete. On the second try, you need to install the package dplyr: install.packages("dplyr").

1

Creating an example data.frame :

dados <- data.frame(x = rnorm(30), y = c("a","b","c"))

To delete lines you will do a logical operation of sets in which you will select the elements that are not in the set.

Let’s create the vector that has the categories of y that you want to withdraw:

excluir <- c("a", "b")

Now we can select only the lines on which y is not in the vector excluir (the ! serves to deny):

dados[!dados$y %in% excluir, ]
           x y
3   0.1003638 c
6   1.4888718 c
9   0.3561347 c
12 -0.4532080 c
15  0.3552320 c
18  0.6220573 c
21 -1.0136110 c
24 -0.4445456 c
27 -0.6974983 c
30  1.0516000 c

How are you saying that your base can be great beyond the dplyr that Molx mentioned, another interesting package is the data.table. With the data.ble would be as follows:

library(data.table)
dados <- data.table(dados)
dados[! y %in% excluir,]
             x y
 1:  0.1003638 c
 2:  1.4888718 c
 3:  0.3561347 c
 4: -0.4532080 c
 5:  0.3552320 c
 6:  0.6220573 c
 7: -1.0136110 c
 8: -0.4445456 c
 9: -0.6974983 c
10:  1.0516000 c

Browser other questions tagged

You are not signed in. Login or sign up in order to post.