Obtaining single records based on two columns

Question

Obtaining single records based on two columns

Asked 3 years, 12 months ago

Viewed 64 times

5

Considering the dataframe df1 intended to obtain a dataframe with unique records (origin, destination) counting the number of unique pairs (origin, destination) and summing the value per unique pair:

df1 <- data.frame(
    origem = c("A","A", "A", "B", "B", "B", "B"), 
    destino=c("A","B","B","A", "A", "C","B"),
    valor=(c(1, 2,NA, 3, 0, 5,-1)))

Intended result:

  origem destino num_total valor_total
1      A       A         1           1
2      A       B         2           2
3      B       A         2           3
4      B       C         1           5
5      B       B         1          -1

3 answers

4

With R base, you can use aggregate. For more than one function, you can join the result of different aggregations:

agg <- merge(
  aggregate(df1[3], by = df1[1:2], length),
  aggregate(df1[3], by = df1[1:2], sum, na.rm = TRUE),
  by = c("origem", "destino"))

names(agg)[3:4] <- c("num_total", "valor_total")

agg
#>   origem destino num_total valor_total
#> 1      A       A         1           1
#> 2      A       B         2           2
#> 3      B       A         2           3
#> 4      B       B         1          -1
#> 5      B       C         1           5

Or concatenate several functions with c:

agg <- aggregate(df1$valor, by = df1[1:2], function(x) c(num_total = length(x), valor_total = sum(x, na.rm = TRUE)))

As pointed out by @Rui-Barradas in the comments, concatenation will result in a variable x containing a matrix with the results. To have a data frame. only with vector columns:

agg <- cbind(agg[-length(agg)], agg[[length(agg)]])

agg
#>   origem destino num_total valor_total
#> 1      A       A         1           1
#> 2      B       A         2           3
#> 3      A       B         2           2
#> 4      B       B         1          -1
#> 5      B       C         1           5

Packages like dplyr (used in the reply by Marcus Nunes) facilitate operations per group. Another option is data table.:

library(data.table)

setDT(df1)

df1[, .(num_total = .N, valor_total = sum(valor, na.rm = TRUE)), .(origem, destino)]
#>    origem destino num_total valor_total
#> 1:      A       A         1           1
#> 2:      A       B         2           2
#> 3:      B       A         2           3
#> 4:      B       C         1           5
#> 5:      B       B         1          -1

1

I voted but one note: if you do str(agg) will see that the variable x is a matrix, the result of rbind of the vectors produced by the function c(). An option to have a single df with column vectors is agg<-aggregate(.) followed by cbind(agg[-length(agg)], agg[[length(agg)]]). Here is the method cbind.data.frame which is called, since agg[-length(agg)] is a df.

– Rui Barradas

2021/08/10 at 05:39
Thanks, I had not noticed this detail. I will edit the ASAP response.

– Carlos Eduardo Lagosta

2021/08/10 at 22:52

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2021-08-09T19:43:59+00:00

I would solve this problem as follows:

library(tidyverse)

df1 %>%
  na.omit() %>%
  group_by(origem, destino) %>%
  summarise(num_total = n(),
            valor_total = sum(valor)) %>%
  ungroup()
#> `summarise()` has grouped output by 'origem'. You can override using the `.groups` argument.
#> # A tibble: 5 × 4
#>   origem destino num_total valor_total
#>   <chr>  <chr>       <int>       <dbl>
#> 1 A      A               1           1
#> 2 A      B               1           2
#> 3 B      A               2           3
#> 4 B      B               1          -1
#> 5 B      C               1           5

^{Created on 2021-08-09 by the reprex package (v2.0.1)}

Explaining each line of code:

na.omit is the function that removes lines with values NA, because they are not necessary for the final sum of occurrences or values
group_by is used to group the data by source and destination, so that the operations from here are applied in the groups created
summarise serves to summarize a data set, applying to it a transformation that will result in an equal or lesser number of lines
- n() account of the occurrences of each group (in this case source and destination pairs)
- sum sum the values of each pair
ungroup removes the grouping of data, maintaining a Tibble simple answer

by Macosso • 1 point · Answer 2 · 2021-08-15T14:23:11+00:00

You can use group_by and summarise from dplyr, with the commands below


library(dplyr)

df1 <- data.frame(
  origem = c("A","A", "A", "B", "B", "B", "B"), 
  destino=c("A","B","B","A", "A", "C","B"),
  valor=(c(1, 2,NA, 3, 0, 5,-1)))

use group_by and summarise, and you need to use na.rm = T to ignore NA when adding

df2 <- df1 %>%
  group_by(origem, destino) %>%
  summarise(num_total = n(),
            valor_total = sum(valor, na.rm = T))

print(df2) # fazer o print do novo dataframe
# A tibble: 5 x 4
# Groups:   origem [2]
  origem destino num_total valor_total
  <chr>  <chr>       <int>       <dbl>
1 A      A               1           1
2 A      B               2           2
3 B      A               2           3
4 B      B               1          -1
5 B      C               1           5