How to consolidate (aggregate or group) the values in a database?

Asked

Viewed 12,797 times

22

Suppose I have the following database

vendas<-c(100,140,200,300,20,1000,200,3000)
vendedor<-c("A","B","A","B","C","C","D","A")
regiao<-c("Norte","Sul","Leste","Norte","Sul","Norte","Leste","Sul")
df<-data.frame(vendedor,regiao,vendas)

And please review the total sales per seller and per seller/region.

How do I generate this new database with aggregated data for analysis?

5 answers

20


Hadley recently created the dplyr, a much faster version with more intuitive syntax than the plyr. (links for the cran and to the announcement on Rstudio’s blog)

In the dplyr would look like this

library(dplyr)
group_by(df,vendedor)%>%summarise(Total=sum(vendas))
  vendedor Total
1        A  3300
2        B   440
3        C  1020
4        D   200

And grouping by seller and region

group_by(df,vendedor, regiao)%>%summarise(Total=sum(vendas))
   vendedor regiao Total
1        A  Leste   200
2        A  Norte   100
3        A    Sul  3000
4        B  Norte   300
5        B    Sul   140
6        C  Norte  1000
7        C    Sul    20
8        D  Leste   200

Editing: the latest version of dplyr uses the operator %>% of magrittr.

12

Another option is to use the package data.table:

df <- data.table(df)
df[,sum(vendas), by=vendedor]
df[,sum(vendas), by=data.table(vendedor, regiao)]

Packages like the plyr and the data.table were created to facilitate and expedite this (and other) types of analysis, so if your bank is large or has to do many groupings, it may be worth studying them!

11

You can do this with the function aggregate

aggregate(vendas ~ vendedor,
          data=df,
          FUN=sum)

  vendedor vendas
1        A   3300
2        B    440
3        C   1020
4        D    200

aggregate(vendas ~ vendedor + regiao,
          data=df,
          FUN=sum)

  vendedor regiao vendas
1        A  Leste    200
2        D  Leste    200
3        A  Norte    100
4        B  Norte    300
5        C  Norte   1000
6        A    Sul   3000
7        B    Sul    140
8        C    Sul     20

Another option is to use the package plyr

library(plyr)

ddply(df,
      c("vendedor"),
      summarise,
      total = sum(vendas))

  vendedor total
1        A  3300
2        B   440
3        C  1020
4        D   200

ddply(df,
      c("vendedor", "regiao"),
      summarise,
      total = sum(vendas))

  vendedor regiao total
1        A  Leste   200
2        A  Norte   100
3        A    Sul  3000
4        B  Norte   300
5        B    Sul   140
6        C  Norte  1000
7        C    Sul    20
8        D  Leste   200

6

You can also use tapply:

Per salesman:

with(df, tapply(vendas, vendedor, sum))
 A    B    C    D 
3300  440 1020  200 

By seller and region:

with(df, tapply(vendas, list(vendedor, regiao), sum))
  Leste Norte  Sul
A   200   100 3000
B    NA   300  140
C    NA  1000   20
D   200    NA   NA

5

Another option is to use sqldf

library(sqldf)
sqldf("SELECT vendedor, sum(vendas) from df group by vendedor")
sqldf("SELECT vendedor, regiao, sum(vendas) from df group by vendedor, regiao")

Browser other questions tagged

You are not signed in. Login or sign up in order to post.