How to create a categorical variable in R?

Question

How to create a categorical variable in R?

Asked 6 years, 6 months ago

Viewed 385 times

-2

I have a base with columns: USUARIO, MÊS, ANO and compra_mês. I would like to create variables dummy based on compras_mês for USUARIO, every month of the year.

   USUARIO MÊS    ANO      compras_mês
     <int> <ord> <dbl>       <int>
 1      37 jan    2019           1
 2      37 set    2018           1
 3      37 out    2018           4
 4      37 nov    2018           3
 5      37 dez    2018           3
 6      49 out    2018           1
 7      49 nov    2018           1
 8      49 dez    2018           1
 9     107 jan    2019           2
10     107 set    2018           1

As an example, my goal would be to leave the database like this:

USUARIO  set  out  nov  dez

 37      1    1    1    1
 49      1    0    1    0
 107     0    0    1    1

How can I do that? The command dummy would be a solution?

Follow the dput to help with the question:

> dput(head(teste, 50))
structure(list(USUARIO = c(37L, 37L, 37L, 37L, 37L, 49L, 49L, 
49L, 107L, 107L, 107L, 107L, 107L, 934L, 934L, 934L, 934L, 934L, 
1116L, 1116L, 1116L, 1116L, 1160L, 1160L, 1160L, 1160L, 1160L, 
1160L, 1302L, 1302L, 1302L, 1302L, 1337L, 1337L, 1337L, 1384L, 
1384L, 1384L, 1384L, 1384L, 1384L, 1532L, 1532L, 1532L, 1532L, 
1532L, 1532L, 1551L, 1551L, 1551L), MÊS = structure(c(1L, 9L, 
10L, 11L, 12L, 10L, 11L, 12L, 1L, 9L, 10L, 11L, 12L, 1L, 8L, 
10L, 11L, 12L, 1L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 
9L, 10L, 11L, 12L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 8L, 9L, 10L), .Label = c("jan", "fev", 
"mar", "abr", "mai", "jun", "jul", "ago", "set", "out", "nov", 
"dez"), class = c("ordered", "factor")), ANO = c(2019, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2018, 
2019, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2019, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2019, 2018, 2018, 2018, 2018, 2018, 2019, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018), compras_mês = c(1L, 1L, 4L, 3L, 3L, 
1L, 1L, 1L, 2L, 1L, 5L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L, 2L, 4L, 
4L, 5L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 2L)), row.names = c(NA, 
-50L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("USUARIO", 
"MÊS"), drop = TRUE, indices = list(0L, 1L, 2L, 3L, 4L, 5L, 6L, 
    7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
    19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 
    31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 
    43L, 44L, 45L, 46L, 47L, 48L, 49L), group_sizes = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), biggest_group_size = 1L, labels = structure(list(USUARIO = c(37L, 
37L, 37L, 37L, 37L, 49L, 49L, 49L, 107L, 107L, 107L, 107L, 107L, 
934L, 934L, 934L, 934L, 934L, 1116L, 1116L, 1116L, 1116L, 1160L, 
1160L, 1160L, 1160L, 1160L, 1160L, 1302L, 1302L, 1302L, 1302L, 
1337L, 1337L, 1337L, 1384L, 1384L, 1384L, 1384L, 1384L, 1384L, 
1532L, 1532L, 1532L, 1532L, 1532L, 1532L, 1551L, 1551L, 1551L
), MÊS = structure(c(1L, 9L, 10L, 11L, 12L, 10L, 11L, 12L, 1L, 
9L, 10L, 11L, 12L, 1L, 8L, 10L, 11L, 12L, 1L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 9L, 10L, 11L, 12L, 10L, 11L, 12L, 
1L, 8L, 9L, 10L, 11L, 12L, 1L, 8L, 9L, 10L, 11L, 12L, 8L, 9L, 
10L), .Label = c("jan", "fev", "mar", "abr", "mai", "jun", "jul", 
"ago", "set", "out", "nov", "dez"), class = c("ordered", "factor"
))), row.names = c(NA, -50L), class = "data.frame", vars = c("USUARIO", 
"MÊS"), drop = TRUE))

@Jdmello, thanks for the suggestions. But could you use the dput data I put in? I managed to get the result, but maybe I can get confused for some other user with the same doubt that I.

– Izak Mandrak

2019/02/04 at 18:51

1 answer

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by JdeMello • **844** points · Answer 1 · 2019-02-04T17:58:46+00:00

We can use data.table to do this operation:

# carregar data.table
library(data.table)

dadosDt <- as.data.table(dados) # transfomar dados em formato data.table

res <- data.table::dcast(dados, USUARIO ~ `MÊS`, fun.aggregate = function(x) 1, value.var = "compras_mês", fill = 0)

Which results in:

> res
   USUARIO jan ago set out nov dez
1       37   1   0   1   1   1   1
2       49   0   0   0   1   1   1
3      107   1   0   1   1   1   1
4      934   1   1   0   1   1   1
5     1116   1   0   0   1   1   1
6     1160   1   1   1   1   1   1
7     1302   0   0   1   1   1   1
8     1337   0   0   0   1   1   1
9     1384   1   1   1   1   1   1
10    1532   1   1   1   1   1   1
11    1551   0   1   1   1   0   0

Explaining the function `data.table::dcast`

The function data.table::dcast is very similar to package function tidyr::spread. You need to transform your data from "long" to "wide", i.e., "transpose" values from certain(s) column(s) into columns. This is possible through the argument formula in data.table::dcast (see ?dcast). The left side of the formula (in the example USUARIO) preserve the columns in the original format with the direct side (here, MES) being "transposed" to wide format. This allows the first step of transformation to occur. However, it is necessary to fill the values in the created columns with the values of MES. In that case, the argument value.var allows you to choose which variable will fill the cells in the columns created in MÊS. In this case, you want to fill out 1 if the user has made one or more purchases in a given month/year. That is why we use compra_mês in value.var. Finally, we use a "hack" fun.aggregate for cells to be filled with 1 if there is one or more events in compras_mês. If this argument is omitted, data.table::dcast count the values on the right side of the formula:

When variable Combinations in formula doesn’t identify a Unique value in a Cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

That’s why you have to be specific fun.aggregate, otherwise data.table::dcast would return how many occurrences they had for each month and user.

How to create a categorical variable in R?

1 answer

Explaining the function data.table::dcast

Explaining the function `data.table::dcast`