How to create a categorical variable in R?

Asked

Viewed 385 times

-2

I have a base with columns: USUARIO, MÊS, ANO and compra_mês. I would like to create variables dummy based on compras_mês for USUARIO, every month of the year.

   USUARIO MÊS    ANO      compras_mês
     <int> <ord> <dbl>       <int>
 1      37 jan    2019           1
 2      37 set    2018           1
 3      37 out    2018           4
 4      37 nov    2018           3
 5      37 dez    2018           3
 6      49 out    2018           1
 7      49 nov    2018           1
 8      49 dez    2018           1
 9     107 jan    2019           2
10     107 set    2018           1

As an example, my goal would be to leave the database like this:

USUARIO  set  out  nov  dez

 37      1    1    1    1
 49      1    0    1    0
 107     0    0    1    1

How can I do that? The command dummy would be a solution?

Follow the dput to help with the question:

> dput(head(teste, 50))
structure(list(USUARIO = c(37L, 37L, 37L, 37L, 37L, 49L, 49L, 
49L, 107L, 107L, 107L, 107L, 107L, 934L, 934L, 934L, 934L, 934L, 
1116L, 1116L, 1116L, 1116L, 1160L, 1160L, 1160L, 1160L, 1160L, 
1160L, 1302L, 1302L, 1302L, 1302L, 1337L, 1337L, 1337L, 1384L, 
1384L, 1384L, 1384L, 1384L, 1384L, 1532L, 1532L, 1532L, 1532L, 
1532L, 1532L, 1551L, 1551L, 1551L), MÊS = structure(c(1L, 9L, 
10L, 11L, 12L, 10L, 11L, 12L, 1L, 9L, 10L, 11L, 12L, 1L, 8L, 
10L, 11L, 12L, 1L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 
9L, 10L, 11L, 12L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 8L, 9L, 10L), .Label = c("jan", "fev", 
"mar", "abr", "mai", "jun", "jul", "ago", "set", "out", "nov", 
"dez"), class = c("ordered", "factor")), ANO = c(2019, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2018, 
2019, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2019, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2019, 2018, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018), compras_mês = c(1L, 1L, 4L, 3L, 3L, 
1L, 1L, 1L, 2L, 1L, 5L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L, 2L, 4L, 
4L, 5L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 2L)), row.names = c(NA, 
-50L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("USUARIO", 
"MÊS"), drop = TRUE, indices = list(0L, 1L, 2L, 3L, 4L, 5L, 6L, 
    7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
    19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 
    31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 
    43L, 44L, 45L, 46L, 47L, 48L, 49L), group_sizes = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), biggest_group_size = 1L, labels = structure(list(USUARIO = c(37L, 
37L, 37L, 37L, 37L, 49L, 49L, 49L, 107L, 107L, 107L, 107L, 107L, 
934L, 934L, 934L, 934L, 934L, 1116L, 1116L, 1116L, 1116L, 1160L, 
1160L, 1160L, 1160L, 1160L, 1160L, 1302L, 1302L, 1302L, 1302L, 
1337L, 1337L, 1337L, 1384L, 1384L, 1384L, 1384L, 1384L, 1384L, 
1532L, 1532L, 1532L, 1532L, 1532L, 1532L, 1551L, 1551L, 1551L
), MÊS = structure(c(1L, 9L, 10L, 11L, 12L, 10L, 11L, 12L, 1L, 
9L, 10L, 11L, 12L, 1L, 8L, 10L, 11L, 12L, 1L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 9L, 10L, 11L, 12L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 8L, 9L, 
10L), .Label = c("jan", "fev", "mar", "abr", "mai", "jun", "jul", 
"ago", "set", "out", "nov", "dez"), class = c("ordered", "factor"
))), row.names = c(NA, -50L), class = "data.frame", vars = c("USUARIO", 
"MÊS"), drop = TRUE))
  • @Jdmello, thanks for the suggestions. But could you use the dput data I put in? I managed to get the result, but maybe I can get confused for some other user with the same doubt that I.

1 answer

3


We can use data.table to do this operation:

# carregar data.table
library(data.table)

dadosDt <- as.data.table(dados) # transfomar dados em formato data.table

res <- data.table::dcast(dados, USUARIO ~ `MÊS`, fun.aggregate = function(x) 1, value.var = "compras_mês", fill = 0)

Which results in:

> res
   USUARIO jan ago set out nov dez
1       37   1   0   1   1   1   1
2       49   0   0   0   1   1   1
3      107   1   0   1   1   1   1
4      934   1   1   0   1   1   1
5     1116   1   0   0   1   1   1
6     1160   1   1   1   1   1   1
7     1302   0   0   1   1   1   1
8     1337   0   0   0   1   1   1
9     1384   1   1   1   1   1   1
10    1532   1   1   1   1   1   1
11    1551   0   1   1   1   0   0

Explaining the function data.table::dcast

The function data.table::dcast is very similar to package function tidyr::spread. You need to transform your data from "long" to "wide", i.e., "transpose" values from certain(s) column(s) into columns. This is possible through the argument formula in data.table::dcast (see ?dcast). The left side of the formula (in the example USUARIO) preserve the columns in the original format with the direct side (here, MES) being "transposed" to wide format. This allows the first step of transformation to occur. However, it is necessary to fill the values in the created columns with the values of MES. In that case, the argument value.var allows you to choose which variable will fill the cells in the columns created in MÊS. In this case, you want to fill out 1 if the user has made one or more purchases in a given month/year. That is why we use compra_mês in value.var. Finally, we use a "hack" fun.aggregate for cells to be filled with 1 if there is one or more events in compras_mês. If this argument is omitted, data.table::dcast count the values on the right side of the formula:

When variable Combinations in formula doesn’t identify a Unique value in a Cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

That’s why you have to be specific fun.aggregate, otherwise data.table::dcast would return how many occurrences they had for each month and user.

  • Hello! I have that result. Maybe I didn’t know how to express myself in the question, but my difficulty is to have a row for each user and have one column per month. According to research I did, creating a categorical variable in the months solves, but I don’t know how I can do it.

  • @Izakmandrak I gave a corrected in my answer, is not that what you are looking for?

  • wouldn’t be that way. The months of my base is online and I would like to leave them in columns, maybe I’m asking the wrong way, I can redo the question differently if you need.

  • try this: res <- data.table::dcast(dados, USUARIO ~ MES, fun.aggregate = function(x) 1, value.var = "COMPRAS", fill = 0)

  • You want to transfer the column MES in multiple columns in dummy/binary format? That’s it?

  • That’s right. I don’t know much about the subject, I must rephrase the question?

  • The question is not wrong in the title, but this excerpt confused me: "I would like to know the amount of purchases USUARIO". My answer has been updated, I think it’s correct...

  • if you have suggestions and want to edit the question please feel free.

  • @Jdmello thanks for the suggestions. But could you use the dput data I put in? I managed to get the result, but maybe I can get confused for some other user with the same doubt as me. - Izak Mandrak

  • only do dados <- dput(...) and the result must be replicable.

  • Yeah. It worked out.

Show 6 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.