Count equal values in one data frame and store in another in R

Asked

Viewed 4,136 times

5

I need to count the equal values of a column in a data frame (database with the total name) and store the total in a column of another data frame (database with the unique name) that contains the unique values of the first data frame. Soon the two seats are of different sizes, the first is larger than the second. For this I used the code below, but R shows the error that the size of the data frames are different.

y <- c(1,2,3,4,5,6,7,8)
espaco_amostral<-data.frame(t(combn(y,m=4))) #Banco com a combinação do vetor y.
total_amostral<-data.frame(TOTAL=apply(espaco_amostral,1,sum)) #Banco com o somatórios das linhas do espaco_amostral.

unicos<-data.frame(unique(total_amostral)) #banco com valores únicos de total_amostral
contador<-0
for(i in unicos){
   for(e in total_amostral){
      ifelse(total_amostral[e,] == unicos[i,],
             unicos[,2]<-contador+1,unicos[,2]<-0)
   }
}
unicos

I believe that in total_amostral[e,] == unicos[i,] it is comparing the entire banks. How can I make it compare each element of the bank total_amostral with the unicos and then tell?

4 answers

8


I wouldn’t try to reinvent the wheel and use a ready-made function on R to do this.

library(dplyr)

total_amostral %>%
  group_by(TOTAL) %>%
  count()
# A tibble: 17 x 2
# Groups:   TOTAL [17]
   TOTAL     n
   <dbl> <int>
 1    10     1
 2    11     1
 3    12     2
 4    13     3
 5    14     5
 6    15     5
 7    16     7
 8    17     7
 9    18     8
10    19     7
11    20     7
12    21     5
13    22     5
14    23     3
15    24     2
16    25     1
17    26     1

What I did was use the package dplyr to group and add your data:

  • %>% is the pipe operator. Basically, it takes the result that is on your left and launches for processing in the command that is on your right. For example, when doing total_amostral %>% group_by(TOTAL), I’m picking up the data frame total_amostral without any processing and grouping their values according to the column TOTAL

  • group_by is a grouping function. It will join what is equal according to some criterion. In this case, I am joining the values of TOTAL are equal

  • finally, count() will simply count the occurrences of each element within the groups created above

Being beginner in the R, I suggest you look for information on the package dplyr. It will give you immense help in pre-processing data for analysis or making basic descriptive statistics.

5

The very function count() can accept the name of a column as argument and it then counts the unique values of that column. This way you can simplify the solution offered by @Marcusnunes.

library(dplyr)
unicos %>% 
  count(TOTAL)
# A tibble: 17 x 2
   TOTAL     n
   <dbl> <int>
 1    10     1
 2    11     1
 3    12     1
 4    13     1
 5    14     1
 6    15     1
 7    16     1
 8    17     1
 9    18     1
10    19     1
11    20     1
12    21     1
13    22     1
14    23     1
15    24     1
16    25     1
17    26     1

Add argumentsort = TRUE in the call of count() organizes the data.frame result of the largest n for the minor.

4

It is also possible to do this in a line with the package data.table:

library(data.table)
setDT(total_amostral) # transforme total_amostral em data.table

total_amostral[, .N, by = "TOTAL"]

Upshot:

> total_amostral[, .N, by = "TOTAL"]
    TOTAL N
 1:    10 1
 2:    11 1
 3:    12 2
 4:    13 3
 5:    14 5
 6:    15 5
 7:    16 7
 8:    17 7
 9:    18 8
10:    19 7
11:    20 7
12:    21 5
13:    22 5
14:    23 3
15:    24 2
16:    25 1
17:    26 1

If the data is large, data.table is probably the best option. Using microbenchmark::microbenchmark to measure the time to perform the operation:

y <- c(1:40) # aumentei a amostra deliberadamente
espaco_amostral<-data.frame(t(combn(y,m=4))) #Banco com a combinação do vetor y.
total_amostral<-data.frame(TOTAL=apply(espaco_amostral,1,sum)) #Banco com o somatórios das linhas do espaco_amostral.

unicos<-data.frame(unique(total_amostral)) #banco com valores únicos de total_amostral

library(data.table)
setDT(total_amostral)
total_amostral[, .N, by = "TOTAL"]

microbenchmark::microbenchmark(
  data.table = total_amostral[, .N, by = "TOTAL"], 
  dplyr = count(total_amostral, TOTAL), 
  base_1 = as.data.frame(table(total_amostral$TOTAL)),
  base_2 = aggregate(TOTAL ~ factor(TOTAL), total_amostral, length), 
  times = 100
)

Unit: milliseconds
       expr        min         lq       mean     median
 data.table   1.996741   2.501213   3.205596   3.083716
      dplyr   6.948481   8.809759  10.996959  10.733755
     base_1   8.100727  10.199018  12.744832  12.457566
     base_2 126.146868 157.310745 202.200196 199.385773
         uq        max neval cld
   3.771053   8.371286   100  a 
  12.777168  29.598866   100  a 
  14.908276  21.049273   100  a 
 236.910523 403.407512   100   b

4

Here are two forms with only R base.

as.data.frame(table(total_amostral$TOTAL))
#   Var1 Freq
#1    10    1
#2    11    1
#3    12    2
#4    13    3
#5    14    5
#6    15    5
#7    16    7
#8    17    7
#9    18    8
#10   19    7
#11   20    7
#12   21    5
#13   22    5
#14   23    3
#15   24    2
#16   25    1
#17   26    1

aggregate(TOTAL ~ factor(TOTAL), total_amostral, length)
#   factor(TOTAL) TOTAL
#1             10     1
#2             11     1
#3             12     2
#4             13     3
#5             14     5
#6             15     5
#7             16     7
#8             17     7
#9             18     8
#10            19     7
#11            20     7
#12            21     5
#13            22     5
#14            23     3
#15            24     2
#16            25     1
#17            26     1

Then you can change the column names of these two results.

res1 <- as.data.frame(table(total_amostral$TOTAL))
names(res1)[1] <- "TOTAL"

res2 <- aggregate(TOTAL ~ factor(TOTAL), total_amostral, length)
names(res2) <- c("TOTAL", "Freq")

Browser other questions tagged

You are not signed in. Login or sign up in order to post.