How to calculate the median of a line in a date.frame in R?

Asked

Viewed 1,523 times

6

I have a database and my goal is to perform some behavior analysis of classes per line.

Example:

print(DADOS)

 Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0

Using the example above, to analyze the classes A, B, C, D and E, at first I used the Sum and Mean of each line. Being the best way or not, to calculate the Sum and the Average I used the function mutate as follows:

DADOS = DADOS %>%
select(Linha:E) %>%
mutate(Soma = (A+B+C+D+E)) %>%
mutate(Média = Soma/5)

And my database went like this:

print(DADOS)

    Linha  A   B   C   D   E   Soma  Média
    L1     4   3   2   2   4   15    3
    L2     1   11  1   1   1   15    3
    L3     0   1   2   3   4   10    2
    L4     2   0   0   8   0   10    2

In the example above we can verify that although the lines have different data, the lines L1 and L2 possess the Soma and the Média the same applies to the Soma and the Média of the lines L3 and L4.

As the Soma and the Média were not so effective, we can include another calculation for the analysis: the Median.

How the Median Works?

In practice, Mediana sorts a data set and identifies its central element.

Example:

L4 = {2, 0, 0, 8, 0}

Ordering L4 = {0,0,0,2,8}

Median L4 = 0

Mediana divides a data set into equal parts in order to find a more assertive distribution trend. The use of Mediana is ideal to identify values in your set of data that escape the standard, the famous "non-standard".

Knowing this, the question is: How can I calculate the median of a line in a date frame.?

  • 1

    See the function median(). ?median

  • 1

    I know the function median, but I don’t know how to use the function to calculate the values in the line.

7 answers

9


A solution may be the following.

library(dplyr)

DADOS %>%
  rowwise() %>%
  mutate(Soma = (A + B + C + D + E),
         Média = Soma/5,
         Mediana = median(c(A, B, C, D, E)))
#Source: local data frame [4 x 9]
#Groups: <by row>
#
## A tibble: 4 x 9
#  Linha     A     B     C     D     E  Soma Média Mediana
#  <fct> <int> <int> <int> <int> <int> <int> <dbl>   <int>
#1 L1        4     3     2     2     4    15     3       3
#2 L2        1    11     1     1     1    15     3       1
#3 L3        0     1     2     3     4    10     2       2
#4 L4        2     0     0     8     0    10     2       0

Dice.

DADOS <- read.table(text = "
Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0                    
", header = TRUE)
  • Thanks for the reply @Rui Barradas, I will test.

  • It worked perfectly! Great solution!

6

A little of the explanation is here.

Reproducing the data

library(tidyverse)
txt <- "Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0"

DADOS <- as_tibble(
  read.table(text = txt, header = TRUE)
)

And then just use the transpose() to apply the median() for each line.

DADOS %>% 
  mutate(
    linhas = DADOS %>% select(-1) %>% 
      transpose() %>% map(unlist),
    mediana = map_dbl(linhas, median)
  ) %>% 
select(-linhas) 

# A tibble: 4 x 7
  Linha     A     B     C     D     E mediana
  <fct> <int> <int> <int> <int> <int>   <dbl>
1 L1        4     3     2     2     4       3
2 L2        1    11     1     1     1       1
3 L3        0     1     2     3     4       2
4 L4        2     0     0     8     0       0
  • Thank you for the reply @Tomás Barcellos, I will test.

5

An alternative that has emerged recently is to use the package rap:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
library(rap)

txt <- "Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0"

DADOS <- as_tibble(
  read.table(text = txt, header = TRUE)
)

DADOS %>% 
  rap(
    soma = numeric() ~ A + B + C + D + E,
    media = numeric() ~ mean(c(A, B, C, D, E)),
    mediana =  numeric() ~ median(c(A, B, C, D, E))
    )
#> # A tibble: 4 x 9
#>   Linha     A     B     C     D     E  soma media mediana
#>   <fct> <int> <int> <int> <int> <int> <dbl> <dbl>   <dbl>
#> 1 L1        4     3     2     2     4    15     3       3
#> 2 L2        1    11     1     1     1    15     3       1
#> 3 L3        0     1     2     3     4    10     2       2
#> 4 L4        2     0     0     8     0    10     2       0

Created on 2019-02-19 by the reprex package (v0.2.1)

The advantage of rap in relation to the rowise is that it allows you to specify the type of output, which can help you avoid some errors.

There is a repository with enough legal information about Row Oriented Workflows here.

In your case I wouldn’t rule out the gather + group_by also, although this does not put the columns exactly in the order you need.

DADOS %>% 
  gather(k, v, A:E) %>% 
  group_by(Linha) %>% 
  summarise(
    soma = sum(v),
    media = mean(v),
    mediana = mean(v)
  ) %>% 
  right_join(DADOS, by = "Linha")
  • Thank you for the reply @Daniel Falbel, I will test.

3

The fastest method is using data.table together with the package matrixStats (bundle rap is still under development so I won’t use it but I suspect it’s very efficient).

For comparative purposes, I will add 200k lines to DADOS:

# carregar pacote
library(data.table)

DADOS <- read.table(text = "
Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0                    
", header = TRUE)

# adicionar linhas
set.seed(1)
m <- data.table(Linha = NA, 
                matrix(data= round(runif(n=1e6,1,15)), 
            ncol = 5))

data.table::setnames(m, paste0("V", 1:5), LETTERS[1:5])


# novo DADOS, com 200k+4 linhas
DADOS <- data.table::rbindlist(list(DADOS, m))

# nomeie colunas para estatisticas descritivas para facilitar remoção
cols <- c("Soma", "Media", "Mediana")

The first method is the fastest, using data.table with the matrixStats:

system.time(
  DADOS[, `:=`(Soma = matrixStats::rowSums2(as.matrix(.SD), na.rm=T),
            Media = matrixStats::rowMeans2(as.matrix(.SD), na.rm=T),
            Mediana = matrixStats::rowMedians(as.matrix(.SD), na.rm=T)),
        .SDcols=LETTERS[1:5]]
)

Upshot:

  user  system elapsed 
   0.08    0.03    0.11 
> head(DADOS)
   Linha A  B C  D E Soma Media Mediana
1:    L1 4  3 2  2 4   15   3.0       3
2:    L2 1 11 1  1 1   15   3.0       1
3:    L3 0  1 2  3 4   10   2.0       2
4:    L4 2  0 0  8 0   10   2.0       0
5:  <NA> 5 12 7 11 5   40   8.0       7
6:  <NA> 6  8 3  8 7   32   6.4       7

The second method is using apeans data.table and added by=1:nrow(DADOS):

DADOS[, (cols) := NULL] # delete as colunas descritivas
# solucao apenas com data.table - ineficiente
system.time(
  DADOS[, `:=`(Soma = sum(.SD, na.rm=T),
               Media = mean(as.numeric(.SD), na.rm=T),
               Mediana = median(as.numeric(.SD), na.rm=T)), 
        .SDcols=LETTERS[1:5], by=1:nrow(DADOS)]
)

Upshot:

   user  system elapsed 
  65.71    0.26   67.78 
> head(DADOS)
   Linha A  B C  D E Soma Media Mediana
1:    L1 4  3 2  2 4   15   3.0       3
2:    L2 1 11 1  1 1   15   3.0       1
3:    L3 0  1 2  3 4   10   2.0       2
4:    L4 2  0 0  8 0   10   2.0       0
5:  <NA> 5 12 7 11 5   40   8.0       7
6:  <NA> 6  8 3  8 7   32   6.4       7

Using dplyr:

DADOS[, (cols) := NULL] # delete as colunas descritivas
library(dplyr)
system.time(
 DADOS <-  DADOS %>%
    rowwise() %>%
    mutate(Soma = (A + B + C + D + E),
           Média = Soma/5,
           Mediana = median(c(A, B, C, D, E))) %>% 
   ungroup()
)

Upshot:

   user  system elapsed 
  69.95    0.17   72.57 
> head(DADOS)
# A tibble: 6 x 9
  Linha     A     B     C     D     E  Soma Média Mediana
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 L1        4     3     2     2     4    15   3         3
2 L2        1    11     1     1     1    15   3         1
3 L3        0     1     2     3     4    10   2         2
4 L4        2     0     0     8     0    10   2         0
5 NA        5    12     7    11     5    40   8         7
6 NA        6     8     3     8     7    32   6.4       7
  • Thanks for the reply @Jdemello, I will test.

3

Applying functions in lines is a very boring thing to do with dplyr. I prefer to transpose the data frame, solve the problem in the columns and then arrange the result in a new object:

resultado <- DADOS %>% 
  select(-LINHA) %>% # tirando LINHA pra poder aplicar as funcoes que interessam
  t() %>% # data frame transposto
  as.data.frame() %>% # precisa converter em data frame pra funcionar
  summarise_all(c(sum, mean, median)) %>% # aplicando as tres funcoes simultaneamente
  matrix(., ncol=3) # oraganizando o resultado como ele deve ficar

DADOS <- cbind(DADOS, resultado)

names(DADOS) <- c(names(DADOS)[1:6], "Soma", "Média", "Mediana")
DADOS
##   LINHA A  B C D E Soma Média Mediana
## 1    L1 4  3 2 2 4   15     3       3
## 2    L2 1 11 1 1 1   15     3       1
## 3    L3 0  1 2 3 4   10     2       2
## 4    L4 2  0 0 8 0   10     2       0
  • Thanks for the reply @Marcus Nunes, I will test.

1

To apply functions in rows/columns, you can use the function apply of the R base.

In your example, to create a new column with the median of the lines, and to add to your data frame the function cbind:

cbind(DADOS, mediana=apply(DADOS[,-1], 1, median))

Explaining the command:

The first argument of cbind: DADOS is your original dataframe that will receive the new column.
The second argument of cbind mediana=apply(DADOS[,-1], 1, median) is the new column named median.

In apply the first argument is the dataframe. Note that I have deleted the first column, since this is not a numeric variable and should not enter for the calculation of the median (or any other statistic you want);
The second arumento is the 1 which means to apply the function of the third argument on the lines. If the value was used 2, the function would be applied to the columns.
The third argument is the function you want to apply. You can use ready-made or custom functions, for example if you wanted to calculate the average plus 2, use apply(mediana2=apply(DADOS[,-1], 1, function(x) median(x)+2

      Linha  A  B  C  D  E  mediana
1     L1     4  3  2  2  4       3
2     L2     1 11  1  1  1       1
3     L3     0  1  2  3  4       2
4     L4     2  0  0  8  0       0
  • Thanks for the reply @Lucas Ferreira, I will test.

1

Using functions of the package purrr you can do analysis byrow in a simpler way. Also, knowing that the median can only be calculated in numbers, you can analyze the vectors by their classes. Taking the example of @Rui Barradas:

dados <- read.table(text = "
Linha  A   B   C   D   E
 L1     4   3   2   2   4
 L2     1   11  1   1   1
 L3     0   1   2   3   4
 L4     2   0   0   8   0                    
", header = TRUE)

The analysis goes like this:

library(tidyverse)

dados %>% 
  mutate(var = pmap(.l = Filter(is.numeric, .), 
                    .f = lift_vd(..f = median)))

  Linha A  B C D E var
1    L1 4  3 2 2 4   3
2    L2 1 11 1 1 1   1
3    L3 0  1 2 3 4   2
4    L4 2  0 0 8 0   0

Easier.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.