Complete observations in a data frame

Asked

Viewed 101 times

5

Suppose I have the following data set:

dados <- structure(list(Ano = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2005L, 2005L, 2005L, 
2005L), Categoria = structure(c(1L, 2L, 4L, 1L, 3L, 4L, 1L, 2L, 
3L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor"), Valor = c(15, 21, 15, 14, 18, 20, 17, 21, 
20, 23, 20, 22, 25, 18, 21, 16)), .Names = c("Ano", "Categoria", 
"Valor"), row.names = c(NA, -16L), class = "data.frame")

dados
    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         d    15
4  2002         a    14
5  2002         c    18
6  2002         d    20
7  2003         a    17
8  2003         b    21
9  2003         c    20
10 2004         b    23
11 2004         c    20
12 2004         d    22
13 2005         a    25
14 2005         b    18
15 2005         c    21
16 2005         d    16

This data set has 3 columns: Year, Category and Value. By definition, the years range from 2001 to 2005 and the only possible categories are a, b, c, d. However, some categories were not registered in some years. For example, category c did not occur in 2001. In turn, category b did not occur in 2002, and so on.

I would like to create a new data frame with 5*4 = 20 lines, as this is the product between the number of years and distinct categories. That is, the new data frame should have all possible permutations between each Year and Category value. The value 0 must be assigned to each permutation that does not exist in the original frame date. That is, I want the following result:

    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         c     0
4  2001         d    15
5  2002         a    14
6  2002         b     0
7  2002         c    18
8  2002         d    20
9  2003         a    17
10 2003         b    21
11 2003         c    20
12 2003         d     0
13 2004         a     0
14 2004         b    23
15 2004         c    20
16 2004         d    22
17 2005         a    25
18 2005         b    18
19 2005         c    21
20 2005         d    16

How to proceed in an automated manner to do this?

2 answers

7


With the tidyr you can use the function complete():

library(tidyr)
complete(dados, Ano, Categoria, fill = list(Valor = 0))
# A tibble: 20 x 3
     Ano Categoria Valor
   <int>    <fctr> <dbl>
 1  2001         a    15
 2  2001         b    21
 3  2001         c     0
 4  2001         d    15
 5  2002         a    14
 6  2002         b     0
 7  2002         c    18
 8  2002         d    20
 9  2003         a    17
10  2003         b    21
11  2003         c    20
12  2003         d     0
13  2004         a     0
14  2004         b    23
15  2004         c    20
16  2004         d    22
17  2005         a    25
18  2005         b    18
19  2005         c    21
20  2005         d    16

4

Only with R base, you can use expand.gridfollowed by merge.

tmp <- expand.grid(Ano = unique(dados$Ano), Categoria = unique(dados$Categoria))

res <- merge(dados, tmp, all.y = TRUE)
res$Valor[which(is.na(res$Valor))] <- 0
res

rm(tmp)  # limpeza final

Browser other questions tagged

You are not signed in. Login or sign up in order to post.