Loop Operation with Multiple Data Frames

Asked

Viewed 129 times

3

Assuming the following example in R:

Sample_data <- data.table(code = c("AAPL","AAPL","AAPL", "AMZN","AMZN","AMZN", "MSFT","MSFT", "GOOG","GOOG","GOOG", "FB"), date = c("2019-12-01","2020-01-01","2020-02-01", "2019-12-01","2020-01-01","2020-02-01", "2020-01-01","2020-02-01", "2019-12-01","2020-01-01","2020-02-01", "2019-12-01"), price = c(292.9,295.4,293.6, 1847.4,1849.3,1845.4, 157.2,159.1, 1337.1,1335.8,1333.7, 205.2 ) )

Precise:

First: Separate the data into different data frames using the "code" column, that is, in this example, I will have 5 different data frames for each code (3x2 for AAPL, 1x2 for FB, etc).

For that, I did the following:

plist <- unique(Sample_data$code)
  lst <- setNames(vector("list", length(plist)), plist)
  for (i in plist) {
    assign(i, Sample_data %>% select(code,date,price) %>% filter(code %in% i))
  }

According to: need to perform a numerical operation within each data frame and add its result into a new column within the same dataframe for each of them.

The problem is being this second step, in which my idea is to generate a list with the generated dataframes and include the values of the operation, but do not know how.

2 answers

2

No need to apply loops to do as you wish. With tidyverse you can do this.

library(tidyverse)

First: Separate the data into different data frames using the "code" column, that is, in this example, I will have 5 different data frames for each code (3x2 for AAPL, 1x2 for FB, etc).

You can do this with the function group_split:

lista <- Sample_data %>% 
  group_split(.tbl = ., code, .keep = TRUE)

According to: need to perform a numerical operation within each data frame and add its result into a new column within the same dataframe for each of them.

I’ll create a new column with the functions mutate and across, the latter recently added to the package dplyr to replace/ unify functions with suffixes all, at and if. Read more about her here.

Maybe across does not work in your current version of dplyr. To solve this, it’s enough:

remotes::install_github("tidyverse/dplyr")

Finally, the analysis. I will add the variable price with the function sum:

lista %>% 
  map(.x = ., .f = ~ mutate(.data = ., across(.cols = c('price'), .fns = list(~ sum(.)))))

Upshot:

[[1]]
# A tibble: 3 x 4
  code  date       price price_1
  <fct> <fct>      <dbl>   <dbl>
1 AAPL  2019-12-01  293.    882.
2 AAPL  2020-01-01  295.    882.
3 AAPL  2020-02-01  294.    882.

[[2]]
# A tibble: 3 x 4
  code  date       price price_1
  <fct> <fct>      <dbl>   <dbl>
1 AMZN  2019-12-01 1847.   5542.
2 AMZN  2020-01-01 1849.   5542.
3 AMZN  2020-02-01 1845.   5542.

[[3]]
# A tibble: 1 x 4
  code  date       price price_1
  <fct> <fct>      <dbl>   <dbl>
1 FB    2019-12-01  205.    205.

[[4]]
# A tibble: 3 x 4
  code  date       price price_1
  <fct> <fct>      <dbl>   <dbl>
1 GOOG  2019-12-01 1337.   4007.
2 GOOG  2020-01-01 1336.   4007.
3 GOOG  2020-02-01 1334.   4007.

[[5]]
# A tibble: 2 x 4
  code  date       price price_1
  <fct> <fct>      <dbl>   <dbl>
1 MSFT  2020-01-01  157.    316.
2 MSFT  2020-02-01  159.    316.

0

How are you using data.table, it is neither necessary nor recommended to divide the data into different data.frames; it is better to do operations per group.

library(data.table)

# Resumo dos dados por grupo

> Sample_data[, .(sum_price = sum(price)), by = code]
   code sum_price
1: AAPL     881.9
2: AMZN    5542.1
3: MSFT     316.3
4: GOOG    4006.6
5:   FB     205.2

# Criar uma nova coluna por grupo

Sample_data[, sum_price := sum(price), by = code]

> Sample_data
    code       date  price sum_price
 1: AAPL 2019-12-01  292.9     881.9
 2: AAPL 2020-01-01  295.4     881.9
 3: AAPL 2020-02-01  293.6     881.9
 4: AMZN 2019-12-01 1847.4    5542.1
 5: AMZN 2020-01-01 1849.3    5542.1
 6: AMZN 2020-02-01 1845.4    5542.1
 7: MSFT 2020-01-01  157.2     316.3
 8: MSFT 2020-02-01  159.1     316.3
 9: GOOG 2019-12-01 1337.1    4006.6
10: GOOG 2020-01-01 1335.8    4006.6
11: GOOG 2020-02-01 1333.7    4006.6
12:   FB 2019-12-01  205.2     205.2

If you really need to separate the data into a data.Tables list, you can use the function data.table::split, but it is always more efficient to do operations by groups.

> split(Sample_data, by = 'code')
$AAPL
   code       date price sum_price
1: AAPL 2019-12-01 292.9     881.9
2: AAPL 2020-01-01 295.4     881.9
3: AAPL 2020-02-01 293.6     881.9

$AMZN
   code       date  price sum_price
1: AMZN 2019-12-01 1847.4    5542.1
2: AMZN 2020-01-01 1849.3    5542.1
3: AMZN 2020-02-01 1845.4    5542.1

$MSFT
   code       date price sum_price
1: MSFT 2020-01-01 157.2     316.3
2: MSFT 2020-02-01 159.1     316.3

$GOOG
   code       date  price sum_price
1: GOOG 2019-12-01 1337.1    4006.6
2: GOOG 2020-01-01 1335.8    4006.6
3: GOOG 2020-02-01 1333.7    4006.6

$FB
   code       date price sum_price
1:   FB 2019-12-01 205.2     205.2

If you prefer tidyverse syntax and functions, you’d better use Tibble as a format for data tables; the use of data.table in this case will not bring any gains. In both cases, keeping the data in a single table and doing operations per group is preferable to separating the data.

library(dplyr)

Sample_data <- tibble(code = c("AAPL","AAPL","AAPL", "AMZN","AMZN","AMZN", "MSFT","MSFT", "GOOG","GOOG","GOOG", "FB"), date = c("2019-12-01","2020-01-01","2020-02-01", "2019-12-01","2020-01-01","2020-02-01", "2020-01-01","2020-02-01", "2019-12-01","2020-01-01","2020-02-01", "2019-12-01"), price = c(292.9,295.4,293.6, 1847.4,1849.3,1845.4, 157.2,159.1, 1337.1,1335.8,1333.7, 205.2 ) )

> Sample_data %>% group_by(code) %>% summarise(sum_price = sum(price))
# A tibble: 5 x 2
  code  sum_price
  <chr>     <dbl>
1 AAPL       882.
2 AMZN      5542.
3 FB         205.
4 GOOG      4007.
5 MSFT       316.

Sample_data %<>% group_by(code) %>% mutate(sum_price = sum(price))

> Sample_data
# A tibble: 12 x 4
# Groups:   code [5]
   code  date       price sum_price
   <chr> <chr>      <dbl>     <dbl>
 1 AAPL  2019-12-01  293.      882.
 2 AAPL  2020-01-01  295.      882.
 3 AAPL  2020-02-01  294.      882.
 4 AMZN  2019-12-01 1847.     5542.
 5 AMZN  2020-01-01 1849.     5542.
 6 AMZN  2020-02-01 1845.     5542.
 7 MSFT  2020-01-01  157.      316.
 8 MSFT  2020-02-01  159.      316.
 9 GOOG  2019-12-01 1337.     4007.
10 GOOG  2020-01-01 1336.     4007.
11 GOOG  2020-02-01 1334.     4007.
12 FB    2019-12-01  205.      205.

> Sample_data %>% group_by(code) %>% group_split()
[[1]]
# A tibble: 3 x 4
  code  date       price sum_price
  <chr> <chr>      <dbl>     <dbl>
1 AAPL  2019-12-01  293.      882.
2 AAPL  2020-01-01  295.      882.
3 AAPL  2020-02-01  294.      882.

[[2]]
# A tibble: 3 x 4
  code  date       price sum_price
  <chr> <chr>      <dbl>     <dbl>
1 AMZN  2019-12-01 1847.     5542.
2 AMZN  2020-01-01 1849.     5542.
3 AMZN  2020-02-01 1845.     5542.

[[3]]
# A tibble: 1 x 4
  code  date       price sum_price
  <chr> <chr>      <dbl>     <dbl>
1 FB    2019-12-01  205.      205.

[[4]]
# A tibble: 3 x 4
  code  date       price sum_price
  <chr> <chr>      <dbl>     <dbl>
1 GOOG  2019-12-01 1337.     4007.
2 GOOG  2020-01-01 1336.     4007.
3 GOOG  2020-02-01 1334.     4007.

[[5]]
# A tibble: 2 x 4
  code  date       price sum_price
  <chr> <chr>      <dbl>     <dbl>
1 MSFT  2020-01-01  157.      316.
2 MSFT  2020-02-01  159.      316.

attr(,"ptype")
# A tibble: 0 x 4
# … with 4 variables: code <chr>, date <chr>, price <dbl>, sum_price <dbl>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.