Obtaining single records based on two columns

Asked

Viewed 64 times

5

Considering the dataframe df1 intended to obtain a dataframe with unique records (origin, destination) counting the number of unique pairs (origin, destination) and summing the value per unique pair:

df1 <- data.frame(
    origem = c("A","A", "A", "B", "B", "B", "B"), 
    destino=c("A","B","B","A", "A", "C","B"),
    valor=(c(1, 2,NA, 3, 0, 5,-1)))

Intended result:

  origem destino num_total valor_total
1      A       A         1           1
2      A       B         2           2
3      B       A         2           3
4      B       C         1           5
5      B       B         1          -1

3 answers

4

I would solve this problem as follows:

library(tidyverse)

df1 %>%
  na.omit() %>%
  group_by(origem, destino) %>%
  summarise(num_total = n(),
            valor_total = sum(valor)) %>%
  ungroup()
#> `summarise()` has grouped output by 'origem'. You can override using the `.groups` argument.
#> # A tibble: 5 × 4
#>   origem destino num_total valor_total
#>   <chr>  <chr>       <int>       <dbl>
#> 1 A      A               1           1
#> 2 A      B               1           2
#> 3 B      A               2           3
#> 4 B      B               1          -1
#> 5 B      C               1           5

Created on 2021-08-09 by the reprex package (v2.0.1)

Explaining each line of code:

  • na.omit is the function that removes lines with values NA, because they are not necessary for the final sum of occurrences or values
  • group_by is used to group the data by source and destination, so that the operations from here are applied in the groups created
  • summarise serves to summarize a data set, applying to it a transformation that will result in an equal or lesser number of lines
    • n() account of the occurrences of each group (in this case source and destination pairs)
    • sum sum the values of each pair
  • ungroup removes the grouping of data, maintaining a Tibble simple answer

4


With R base, you can use aggregate. For more than one function, you can join the result of different aggregations:

agg <- merge(
  aggregate(df1[3], by = df1[1:2], length),
  aggregate(df1[3], by = df1[1:2], sum, na.rm = TRUE),
  by = c("origem", "destino"))

names(agg)[3:4] <- c("num_total", "valor_total")

agg
#>   origem destino num_total valor_total
#> 1      A       A         1           1
#> 2      A       B         2           2
#> 3      B       A         2           3
#> 4      B       B         1          -1
#> 5      B       C         1           5

Or concatenate several functions with c:

agg <- aggregate(df1$valor, by = df1[1:2], function(x) c(num_total = length(x), valor_total = sum(x, na.rm = TRUE)))

As pointed out by @Rui-Barradas in the comments, concatenation will result in a variable x containing a matrix with the results. To have a data frame. only with vector columns:

agg <- cbind(agg[-length(agg)], agg[[length(agg)]])

agg
#>   origem destino num_total valor_total
#> 1      A       A         1           1
#> 2      B       A         2           3
#> 3      A       B         2           2
#> 4      B       B         1          -1
#> 5      B       C         1           5

Packages like dplyr (used in the reply by Marcus Nunes) facilitate operations per group. Another option is data table.:

library(data.table)

setDT(df1)

df1[, .(num_total = .N, valor_total = sum(valor, na.rm = TRUE)), .(origem, destino)]
#>    origem destino num_total valor_total
#> 1:      A       A         1           1
#> 2:      A       B         2           2
#> 3:      B       A         2           3
#> 4:      B       C         1           5
#> 5:      B       B         1          -1
  • 1

    I voted but one note: if you do str(agg) will see that the variable x is a matrix, the result of rbind of the vectors produced by the function c(). An option to have a single df with column vectors is agg<-aggregate(.) followed by cbind(agg[-length(agg)], agg[[length(agg)]]). Here is the method cbind.data.frame which is called, since agg[-length(agg)] is a df.

  • Thanks, I had not noticed this detail. I will edit the ASAP response.

0

You can use group_by and summarise from dplyr, with the commands below


library(dplyr)

df1 <- data.frame(
  origem = c("A","A", "A", "B", "B", "B", "B"), 
  destino=c("A","B","B","A", "A", "C","B"),
  valor=(c(1, 2,NA, 3, 0, 5,-1)))

use group_by and summarise, and you need to use na.rm = T to ignore NA when adding

df2 <- df1 %>%
  group_by(origem, destino) %>%
  summarise(num_total = n(),
            valor_total = sum(valor, na.rm = T))

print(df2) # fazer o print do novo dataframe
# A tibble: 5 x 4
# Groups:   origem [2]
  origem destino num_total valor_total
  <chr>  <chr>       <int>       <dbl>
1 A      A               1           1
2 A      B               2           2
3 B      A               2           3
4 B      B               1          -1
5 B      C               1           5

Browser other questions tagged

You are not signed in. Login or sign up in order to post.