How to summarize data in R?

Asked

Viewed 842 times

2

I have a sample of shopping data and would like to know how many purchases had per user in total.

dput to assist response:

structure(list(USUARIO = c(931053L, 276977L, 354508L, 909717L, 
69758L, 104827L, 6600051L, 5035952L, 335505L, 340387L, 103130L, 
317058L, 424447L, 6862455L, 5040771L, 2360439L, 346941L, 426400L, 
271410L, 809550L, 96394L, 161292L, 752270L, 3703472L, 260921L, 
20557L, 291092L, 806951L, 82997L, 984555L, 5080457L, 31454L, 
5123415L, 498622L, 786436L, 320239L, 29603L, 6583452L, 304246L, 
6734562L, 101254L, 516730L, 37847L, 6928520L, 7705558L, 299285L, 
7760544L, 7760206L, 377014L, 104312L, 433721L, 87913L, 6732808L, 
633687L, 7526265L, 5038688L, 7500519L, 6640730L, 420430L, 47049L, 
7699248L, 6898123L, 7698394L, 7723798L, 577026L, 296424L, 165665L, 
152160L, 797450L, 90960L, 352622L, 6827072L, 7812492L, 532571L, 
6795263L, 7611543L, 429681L, 21840L, 6683144L, 18176L, 389995L, 
748456L, 423368L, 325129L, 7541131L, 186283L, 7795747L, 6760326L, 
6849786L, 202426L, 56131L, 676905L, 7550723L, 258189L, 123517L, 
368966L, 373162L, 183484L, 7583616L, 7716239L), DATA = structure(c(17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731
), class = "Date"), VL_PED_PG = c(20, 20, 50, 20.32, 20, 30, 
50, 50, 50, 50, 20, 20, 30, 30, 30, 30, 30, 50, 30, 50, 30, 20, 
30, 20, 30, 50, 30, 50, 50, 46, 50, 30, 50, 15, 20, 30, 50, 20, 
30, 30, 50, 50, 20, 50, 46, 50, 48, 49, 40, 20, 50, 50, 20, 30, 
40, 49, 8, 16, 49, 20, 40, 16, 16, 46, 20, 10, 50, 20, 50, 12, 
30, 48, 6, 50, 30, 49, 10, 20, 20, 30, 12, 50, 30, 30, 26, 50, 
26, 50, 50, 10, 10, 30, 46, 20, 15, 50, 20, 20, 26, 16)), row.names = c(NA, 
100L), class = "data.frame")

I’m trying to use summarise package dplyr but I’m struggling with how I can use it. What should I do?

  • 1

    Do you want to know the number of purchases per user, where each user can be considered a group, because it makes several purchases? If yes, this data set provided makes no sense, because in it each user is unique. And, what is the variable "purchases"?

  • 1

    hello, is that the database is very large, but with full data users repeat yes, in case this sample has only 100 lines.

  • 1

    I understand. Take a look at @Tomás' reply, as it answers what you ask.

  • I will test with the full basis. thank you very much.

1 answer

4


The works like this. You group the data by your unit of analysis, in case USUARIO and then create a summary of the data for that analysis unit.

In this case we have

library(dplyr)
dados %>% 
  group_by(USUARIO) %>% 
  summarise(quantidade = n())

# A tibble: 100 x 2
   USUARIO quantidade
     <int>      <int>
 1   18176          1
 2   20557          1
 3   21840          1
 4   29603          1
 5   31454          1
 6   37847          1
 7   47049          1
 8   56131          1
 9   69758          1
10   82997          1
# ... with 90 more rows

Great material on the dplyr in English can be found here. This cheatsheet may also be useful.

The dplyr also offers a utility function that counts the occurrence of each unique value of a variable in the data.frame. It is the function count(). Your syntax is to pass the data.frame as first argument and then list the variables that make up the unit of analysis to do the counting. Using it we would have:

dados %>% count(USUARIO)

# A tibble: 100 x 2
   USUARIO     n
     <int> <int>
 1   18176     1
 2   20557     1
 3   21840     1
 4   29603     1
 5   31454     1
 6   37847     1
 7   47049     1
 8   56131     1
 9   69758     1
10   82997     1
# ... with 90 more rows

The function count() has an argument sort, which must be passed explicitly, which sorts the result into the largest n for the minor.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.