Separate data by values on the line?

Question

Separate data by values on the line?

Asked 3 years, 4 months ago

Viewed 285 times

3

In the lines of my dataframe I have an identifier with letters for each observation set. Some of the observations have more than one letter, in which case these observations are actions together of the respective letters "single".

ex:

a = a set of action,
b = another set,
ab = value of a+b associated.

I would like to separate my data in r, from the associated letters, obtaining a new set of data only with the data referring to the associated values and the respective singles.

ex:

a, b, ab.

I put an image to illustrate better and I also have a code for the data example.

Code for the data:

comprar = c(rep("a",times = 4), rep("b",times = 4), rep("c",times = 4), rep("ab",times = 4), rep("ac",times = 4), rep("bc",times = 4))

custo = c(12,14,16,18,22,24,26,28,17,19,21,23,34,36,38,42,44,46,48,52,62,64,66,68)

data = cbind(comprar, custo)

Thank you.

4 answers

6

You can use split:

data = data.frame(comprar=comprar, custo=custo)
Y = split(data, data$comprar)

For each different level of comprar you’re gonna have a data.frame within a list:

$a
  comprar custo
1       a    12
2       a    14
3       a    16
4       a    18

$ab
   comprar custo
13      ab    34
14      ab    36
15      ab    38
16      ab    42

$ac
   comprar custo
17      ac    44
18      ac    46
19      ac    48
20      ac    52

If you want each object in that list to be in one data.frame directly, you can do:

list2env(Y, envir = .GlobalEnv)
a
ab

And the unique names of the variable comprar will be the names of data.frames at your session

Browser other questions tagged r date split

You are not signed in. Login or sign up in order to post.

by Carlos Eduardo Lagosta • **5,497** points · Answer 1 · 2021-03-12T03:07:56+00:00

Like answered by @Guilherme-Parreira use split is the best way to separate the data by a variable:

dados <- data.frame(
  comprar = c(rep("a",times = 4), rep("b",times = 4), rep("c",times = 4), rep("ab",times = 4), rep("ac",times = 4), rep("bc",times = 4)),
  custo = c(12,14,16,18,22,24,26,28,17,19,21,23,34,36,38,42,44,46,48,52,62,64,66,68)
)

dados.lista <- split(dados, dados$comprar)

To select groups and associations, you can use a small function using strsplit:

selGrp <- function(grp) c(strsplit(grp, "")[[1]], grp)

selGrp("abc")
#> [1] "a"   "b"   "c"   "abc"

dados.lista[selGrp("ab")]
#> $a
#>   comprar custo
#> 1       a    12
#> 2       a    14
#> 3       a    16
#> 4       a    18
#>
#> $b
#>   comprar custo
#> 5       b    22
#> 6       b    24
#> 7       b    26
#> 8       b    28
#>
#> $ab
#>    comprar custo
#> 13      ab    34
#> 14      ab    36
#> 15      ab    38
#> 16      ab    42

To do this for all associations:

# Níveis da variável comprar:
lv.comprar <- levels(dados$comprar)
# ou, se a variável comprar não for do tipo fator:
lv.comprar <- as.character(unique(dados$comprar))

# Encontra as associações (i.e., os valores que tem mais de um caracter):
assocs <- lv.comprar[nchar(lv.comprar) > 1]

# Cria uma lista nomeada em que cada elemento é uma lista com os data.frames selecionados de cada associação:
dados.spl <- setNames(lapply(assocs, function(x) dados.lista[selGrp(x)]), assocs)

str(dados.spl, max.level = 2)
#> List of 3
#>  $ ab:List of 3
#>   ..$ a :'data.frame':   4 obs. of  2 variables:
#>   ..$ b :'data.frame':   4 obs. of  2 variables:
#>   ..$ ab:'data.frame':   4 obs. of  2 variables:
#>  $ ac:List of 3
#>   ..$ a :'data.frame':   4 obs. of  2 variables:
#>   ..$ c :'data.frame':   4 obs. of  2 variables:
#>   ..$ ac:'data.frame':   4 obs. of  2 variables:
#>  $ bc:List of 3
#>   ..$ b :'data.frame':   4 obs. of  2 variables:
#>   ..$ c :'data.frame':   4 obs. of  2 variables:
#>   ..$ bc:'data.frame':   4 obs. of  2 variables:

dados.spl$ac
#> $a
#>   comprar custo
#> 1       a    12
#> 2       a    14
#> 3       a    16
#> 4       a    18
#>
#> $c
#>    comprar custo
#> 9        c    17
#> 10       c    19
#> 11       c    21
#> 12       c    23
#>
#> $ac
#>    comprar custo
#> 17      ac    44
#> 18      ac    46
#> 19      ac    48
#> 20      ac    52

Keeping as list is the most versatile. If you want to convert the elements to data.frame:

# Formato comprido
dados.spl.comp <- lapply(dados.spl, data.frame)

dados.spl.comp$ac
#>   a.comprar a.custo c.comprar c.custo ac.comprar ac.custo
#> 1         a      12         c      17         ac       44
#> 2         a      14         c      19         ac       46
#> 3         a      16         c      21         ac       48
#> 4         a      18         c      23         ac       52

# Formato longo
dados.spl.long <- lapply(dados.spl, function(x) do.call(rbind, x))

dados.spl.long$ac
#>       comprar custo
#> a.1         a    12
#> a.2         a    14
#> a.3         a    16
#> a.4         a    18
#> c.9         c    17
#> c.10        c    19
#> c.11        c    21
#> c.12        c    23
#> ac.17      ac    44
#> ac.18      ac    46
#> ac.19      ac    48
#> ac.20      ac    52

Or, in a line:

setNames(lapply(assocs, function(x) do.call(rbind, dados.lista[selGrp(x)])), assocs)

setNames(lapply(assocs, function(x) data.frame(dados.lista[selGrp(x)])), assocs)

by abreums • 54 points · Answer 2 · 2021-03-16T16:34:25+00:00

I understand that the goal is to get the final sets already separated, and that the process can be the most automatic.

If we had the lists with the values of the "buy" attribute of each subset, it would be enough to filter to get the answer. This is what is implemented in the "retriev_sub_df function below".

It seems to me that the main challenge is precisely to get the different sets <a, b, ab> , <a, c, ac>, etc... Or even implement this in such a way that it supports other combinations besides the sample data.

However, <a, b, ab> has information that is already within "ab". That is, if it is possible to get the list of tuples, the individual elements can be recovered. This is what makes the function "retrieve_tokens" below.

The use of lists and the map function ensures an elegant automatic solution.

library(tidyverse)

df <- tibble(comprar = comprar,
             custo = custo)

retrive_tokens <- function() {
  tokens <- tibble(df %>% distinct(comprar))
  tokens <- tokens %>% 
    mutate(
      len = str_length(comprar)
    ) 
  max_tl = max(token$len)
  tokens <- tokens %>% 
    filter(len == max_tl)
  tokens$comprar
}

retrieve_sub_df <- function(token) {
  criteria = unlist(list(token, str_split(token, "")))
  sub_df <- df %>% 
    filter(
      comprar %in% criteria
    )
}

tokens = retrive_tokens()
result <- map(tokens, ~retrieve_sub_df(.x))

result[[1]]
result[[2]]
result[[3]]

by Paulo Marques • **3,739** points · Answer 3 · 2021-03-11T01:09:32+00:00

Updated on 03/11/2021

Due to the negative score received in my reply, I am updating this post with information that may not be known to everyone.

It is possible to run Python code within a script r using the reticulate

Taken directly from the website (in English)

The reticulate package provides a Comprehensive set of tools for Interoperability between Python and R. The package includes facilities for:

Calling Python from R in a Variety of Ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python Interactively Within an R Session.

Translation between R and Python Objects (for example, between R and Pandas data frames, or between R matrices and Numpy arrays).

Flexible Binding to Different versions of Python including virtual Environments and Conda Environments.

In summary of the above, it is possible to call a Python script from within R, translate from R objects to Python objects (e.g., R and Pandas) and use Python virtual environments.

I believe that the solution below will meet the request of the questioner.

End of update

Solution based on Python

import pandas as pd

comprar = ["a"]*4 + ["b"]*4 + ["c"]*4 + ["ab"]*4 + ["ac"]*4 + ["bc"]*4
custo = [12,14,16,18,22,24,26,28,17,19,21,23,34,36,38,42,44,46,48,52,62,64,66,68]

df = pd.DataFrame({"comprar": comprar, "custo": custo})

grupos = df.groupby(["comprar"])

for grupo in grupos:
    print(f"Chave = {grupo[0]}")
    print(f"Grupo = \n{grupo[1]}")
    print(80*"-")

The exit will be

Chave = a
Grupo =
   index comprar  custo
0      0       a     12
1      1       a     14
2      2       a     16
3      3       a     18
--------------------------------------------------------------------------------
Chave = ab
Grupo =
   index comprar  custo
0     12      ab     34
1     13      ab     36
2     14      ab     38
3     15      ab     42
--------------------------------------------------------------------------------
Chave = ac
Grupo =
   index comprar  custo
0     16      ac     44
1     17      ac     46
2     18      ac     48
3     19      ac     52
--------------------------------------------------------------------------------
Chave = b
Grupo =
   index comprar  custo
0      4       b     22
1      5       b     24
2      6       b     26
3      7       b     28
--------------------------------------------------------------------------------
Chave = bc
Grupo =
   index comprar  custo
0     20      bc     62
1     21      bc     64
2     22      bc     66
3     23      bc     68
--------------------------------------------------------------------------------

Updated on 03/11/2021

Turn groups into a dataframe dictionary, where the key is the grouped value and the value is the grouped result

d = dict(tuple(grupos))

d["a"]
  comprar  custo
0       a     12
1       a     14
2       a     16
3       a     18


d["ac"]
   comprar  custo
16      ac     44
17      ac     46
18      ac     48
19      ac     52

End of update