What is wide/long data?

Asked

Viewed 1,438 times

9

It is common to read/hear comments saying that the data is in format wide or long.

  1. Which means a table is in format wide? And long?

  2. Which packages/functions in R can be used to transform a table wide in long and vice versa?


There is a that question, but it is restricted to one of the cases and only its operationalization, without entering into conceptual questions.

2 answers

7


  1. Difference between wide and long

Wide format

In broad format (wide format), the responses of the same individuals will be in a single row and each response will be in a separate column.

For example, in the following dataset, each county was measured at four time points, once every 10 years, starting in 1970. The result variable is Jobs and indicates the number of jobs in each county. There are three predictive variables: Area of the Earth, Natural Amenity (4 = no and 3 = Yes) and the proportion of the county’s population that year College.

As the area of land and the presence of a natural amenity do not change from decade to decade, these predictors have only one variable per municipality. But both our result, Jobs, and a predictor, College, have different values in each year, so require a different variable (column) for each year.

Imgur

Long format

In long format (long format), Each line is one time point per subject. Therefore, each subject (county) will have data in several lines. All variables that do not change over time will have the same value on all lines.

You can see the same data from the five counties below in long format. Each county has four rows of data - one for each year.

All the same information is there; we just configure the data differently.

We don’t need more than four columns to Jobs or College. Instead, all four values of Jobs for each county are stacked. They are all in the column Jobs. The same goes for the four values of College.

But to track which observation occurred in which year, we need to add a variable Year.

You’ll notice that the variables that haven’t changed from year to year - Land Area and Natural Amenity - have the same value on each of the four lines in each county. It sounds strange, but it’s okay to have it that way and as long as you analyze the data using the right procedures, you’ll take into account that they’re redundant.

Imgur

Source

  1. Packets

    • tidyr - Function: gather (format wide for long) and spread (format long for wide)

    • reshape2 - Function: melt (format wide for long) and dcast (format long for wide)

    • groundwork - Function: reshape

  2. Additional materials: link1; Link2

5

What does it mean that a table is in wide format? And long?

  • A database in format wide is the one in which the variables are unpaired (one separate from the other).
  • A database in format long is the one in which the variables are stacked (one on top of the other).

The following figure exemplifies this:

inserir a descrição da imagem aqui

Note that in the format wide, x and y are the variable names. When you convert the database to the format long, the values of these variables will be stacked with the respective name of the variable next to it (i.e., in another column), indicating that these values correspond to it.

Which packages/functions in R can be used to turn a wide table into long and vice versa?

In the r utils you can use the function stack (stack) to convert a data.frame of wide for long:

dataset<-data.frame(matrix(runif(6*5,1,10),ncol=6)) # criação do data.frame
dataset

        X1       X2       X3       X4       X5       X6
1 7.349284 6.028351 2.688078 6.125223 1.221548 9.612955
2 4.069976 2.874686 6.672611 7.392773 8.788791 1.947049
3 5.601091 5.088117 6.642646 1.919682 5.083521 2.890271
4 8.972191 5.440744 1.900963 2.321034 4.617486 3.135706
5 6.863326 3.664501 8.406267 2.357013 7.787931 5.592315

empilhar<-stack(dataset,select=(1:6)) # converte para o formato long
empilhar
     values ind
1  7.349284  X1
2  4.069976  X1
3  5.601091  X1
4  8.972191  X1
5  6.863326  X1
6  6.028351  X2
7  2.874686  X2
8  5.088117  X2
9  5.440744  X2
10 3.664501  X2
11 2.688078  X3
12 6.672611  X3
13 6.642646  X3
14 1.900963  X3
15 8.406267  X3
16 6.125223  X4
17 7.392773  X4
18 1.919682  X4
19 2.321034  X4
20 2.357013  X4
21 1.221548  X5
22 8.788791  X5
23 5.083521  X5
24 4.617486  X5
25 7.787931  X5
26 9.612955  X6
27 1.947049  X6
28 2.890271  X6
29 3.135706  X6
30 5.592315  X6

The function unstack performs the reverse process. That is, converts from long for wide:

desempilhar<-unstack(empilhar) # converte para o formato wide

        X1       X2       X3       X4       X5       X6
1 7.349284 6.028351 2.688078 6.125223 1.221548 9.612955
2 4.069976 2.874686 6.672611 7.392773 8.788791 1.947049
3 5.601091 5.088117 6.642646 1.919682 5.083521 2.890271
4 8.972191 5.440744 1.900963 2.321034 4.617486 3.135706
5 6.863326 3.664501 8.406267 2.357013 7.787931 5.592315

In the tidyverse, the package tidyr It is also useful to do what you want. To stack, you can use the function gather. Considering the same data set:

library(tidyr)

res1<-gather(dataset, key='factor', value = 'my', 1:6)
head(res1,10)

   factor       my
1      X1 5.938725
2      X1 4.367486
3      X1 3.220609
4      X1 3.357561
5      X1 9.275956
6      X2 2.260197
7      X2 5.880264
8      X2 2.891555
9      X2 7.641574
10     X2 9.611466
  • where: key is the name of the vector that will bear the names of the stacked variables; value represents the variables to be stacked.

The reverse process of the function gather is given by the function spread (similar to unstack). More details on documentation.

There are other Packages who do the same (as reshape2).

Main data utilities in format long:

  • Automatically creates groups for variables (useful in some analyses, such as ANOVA). Moreover, in some functions of tidyverse, as ggplot::facet_wrap, the data must be in this format.

  • Prevents the execution of loops, when using the group variable as if they were columns of a data.frame.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.