How to summarize data in R?

Question

How to summarize data in R?

Asked 6 years, 6 months ago

Viewed 842 times

2

I have a sample of shopping data and would like to know how many purchases had per user in total.

dput to assist response:

structure(list(USUARIO = c(931053L, 276977L, 354508L, 909717L, 
69758L, 104827L, 6600051L, 5035952L, 335505L, 340387L, 103130L, 
317058L, 424447L, 6862455L, 5040771L, 2360439L, 346941L, 426400L, 
271410L, 809550L, 96394L, 161292L, 752270L, 3703472L, 260921L, 
20557L, 291092L, 806951L, 82997L, 984555L, 5080457L, 31454L, 
5123415L, 498622L, 786436L, 320239L, 29603L, 6583452L, 304246L, 
6734562L, 101254L, 516730L, 37847L, 6928520L, 7705558L, 299285L, 
7760544L, 7760206L, 377014L, 104312L, 433721L, 87913L, 6732808L, 
633687L, 7526265L, 5038688L, 7500519L, 6640730L, 420430L, 47049L, 
7699248L, 6898123L, 7698394L, 7723798L, 577026L, 296424L, 165665L, 
152160L, 797450L, 90960L, 352622L, 6827072L, 7812492L, 532571L, 
6795263L, 7611543L, 429681L, 21840L, 6683144L, 18176L, 389995L, 
748456L, 423368L, 325129L, 7541131L, 186283L, 7795747L, 6760326L, 
6849786L, 202426L, 56131L, 676905L, 7550723L, 258189L, 123517L, 
368966L, 373162L, 183484L, 7583616L, 7716239L), DATA = structure(c(17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 
17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731, 17731
), class = "Date"), VL_PED_PG = c(20, 20, 50, 20.32, 20, 30, 
50, 50, 50, 50, 20, 20, 30, 30, 30, 30, 30, 50, 30, 50, 30, 20, 
30, 20, 30, 50, 30, 50, 50, 46, 50, 30, 50, 15, 20, 30, 50, 20, 
30, 30, 50, 50, 20, 50, 46, 50, 48, 49, 40, 20, 50, 50, 20, 30, 
40, 49, 8, 16, 49, 20, 40, 16, 16, 46, 20, 10, 50, 20, 50, 12, 
30, 48, 6, 50, 30, 49, 10, 20, 20, 30, 12, 50, 30, 30, 26, 50, 
26, 50, 50, 10, 10, 30, 46, 20, 15, 50, 20, 20, 26, 16)), row.names = c(NA, 
100L), class = "data.frame")

I’m trying to use summarise package dplyr but I’m struggling with how I can use it. What should I do?

1

Do you want to know the number of purchases per user, where each user can be considered a group, because it makes several purchases? If yes, this data set provided makes no sense, because in it each user is unique. And, what is the variable "purchases"?

– neves

2019/01/22 at 19:33
1

hello, is that the database is very large, but with full data users repeat yes, in case this sample has only 100 lines.

– Izak Mandrak

2019/01/22 at 19:50
1

I understand. Take a look at @Tomás' reply, as it answers what you ask.

– neves

2019/01/22 at 19:51
I will test with the full basis. thank you very much.

– Izak Mandrak

2019/01/22 at 19:51

1 answer

Browser other questions tagged r dplyr

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2019-01-22T19:50:04+00:00

The dplyr works like this. You group the data by your unit of analysis, in case USUARIO and then create a summary of the data for that analysis unit.

In this case we have

library(dplyr)
dados %>% 
  group_by(USUARIO) %>% 
  summarise(quantidade = n())

# A tibble: 100 x 2
   USUARIO quantidade
     <int>      <int>
 1   18176          1
 2   20557          1
 3   21840          1
 4   29603          1
 5   31454          1
 6   37847          1
 7   47049          1
 8   56131          1
 9   69758          1
10   82997          1
# ... with 90 more rows

Great material on the dplyr in English can be found here. This cheatsheet may also be useful.

The dplyr also offers a utility function that counts the occurrence of each unique value of a variable in the data.frame. It is the function count(). Your syntax is to pass the data.frame as first argument and then list the variables that make up the unit of analysis to do the counting. Using it we would have:

dados %>% count(USUARIO)

# A tibble: 100 x 2
   USUARIO     n
     <int> <int>
 1   18176     1
 2   20557     1
 3   21840     1
 4   29603     1
 5   31454     1
 6   37847     1
 7   47049     1
 8   56131     1
 9   69758     1
10   82997     1
# ... with 90 more rows

The function count() has an argument sort, which must be passed explicitly, which sorts the result into the largest n for the minor.