Create a column with the second highest row value

Asked

Viewed 88 times

4

I would like to create a column on dataframe teste with the second highest value of the respective line, as I did to find Min and Max.

library("matrixStats")
teste <- read.table("https://raw.githack.com/fsbmat/StackOverflow/master/teste.txt",header = TRUE)
teste$ML_anual <- round(rowSums(teste[,c("LIQ_Jan2013","LIQ_Fev2013","LIQ_Mar2013","LIQ_Abr2013","LIQ_Mai2013","LIQ_Jun2013","LIQ_Jul2013","LIQ_Ago2013","LIQ_Set2013","LIQ_Out2013","LIQ_Nov2013","LIQ_Dez2013")],na.rm = T)/12,digits=2)
teste$Min <- round(rowMins(as.matrix(teste[,c("LIQ_Jan2013","LIQ_Fev2013","LIQ_Mar2013","LIQ_Abr2013","LIQ_Mai2013","LIQ_Jun2013","LIQ_Jul2013","LIQ_Ago2013","LIQ_Set2013","LIQ_Out2013","LIQ_Nov2013","LIQ_Dez2013")]),na.rm = T),digits=2)
teste$Max <- round(rowMaxs(as.matrix(teste[,c("LIQ_Jan2013","LIQ_Fev2013","LIQ_Mar2013","LIQ_Abr2013","LIQ_Mai2013","LIQ_Jun2013","LIQ_Jul2013","LIQ_Ago2013","LIQ_Set2013","LIQ_Out2013","LIQ_Nov2013","LIQ_Dez2013")]),na.rm = T),digits=2)

The original database has 130,000 lines, so I’d like to find a function that doesn’t require a loop for to speed up the process!

2 answers

4

Here are two functions that compute the second largest element in each row of a table. The second function uses the knowledge of which is the maximum value on each line to calculate the 2nd largest.

SegundoMaiorLinha <- function(DF){
  m2 <- rep(-Inf, nrow(DF))
  for(i in seq_len(nrow(DF))){
    M <- max(DF[i, ], na.rm = TRUE)
    for(j in seq_len(ncol(DF))){
      if(DF[i, j] > m2[i] && DF[i, j] != M) m2[i] <- DF[i, j]
    }
  }
  m2
}

SegundoMaiorLinha2 <- function(DF, Max){
  m2 <- rep(NA, nrow(DF))
  for(i in seq_len(nrow(DF))){
    DF[i, which(DF[i, ] == Max[i])] <- -Inf
    m2[i] <- max(DF[i, ], na.rm = TRUE)
  }
  m2
}

This vector cols is just to simplify the code. Then both functions are tested and the results are the same.

cols <- c("LIQ_Jan2013","LIQ_Fev2013","LIQ_Mar2013","LIQ_Abr2013",
          "LIQ_Mai2013","LIQ_Jun2013","LIQ_Jul2013","LIQ_Ago2013",
          "LIQ_Set2013","LIQ_Out2013","LIQ_Nov2013","LIQ_Dez2013")

SegundoMaiorLinha(teste[, cols])
#[1] 10509.62 13577.62  5628.19

SegundoMaiorLinha2(teste[, cols], teste$Max)
#[1] 10509.62 13577.62  5628.19

Performance test. The difference is not great but the second is faster.

library(microbenchmark)
library(ggplot2)

mb <- microbenchmark(
  fun = SegundoMaiorLinha(teste[, cols]),
  fun2 = SegundoMaiorLinha2(teste[, cols], teste$Max),
  times = 1e3
)

print(mb, order = "median")
autoplot(mb)

3


A faster and simpler way to do it is via apply:

cols <- c("LIQ_Jan2013","LIQ_Fev2013","LIQ_Mar2013","LIQ_Abr2013","LIQ_Mai2013","LIQ_Jun2013","LIQ_Jul2013","LIQ_Ago2013","LIQ_Set2013","LIQ_Out2013","LIQ_Nov2013","LIQ_Dez2013")
teste$seg.max <- apply(teste[,cols], 1, FUN = function(x){sort(x, decreasing = T)[2]})

Compared to the rest:

SegundoApply <- function(DF){
  apply(DF, 1, FUN = function(x){sort(x, decreasing = T)[2]})
}
mb <- microbenchmark(
  fun = SegundoMaiorLinha(teste[, cols]),
  fun2 = SegundoMaiorLinha2(teste[, cols], teste$Max),
  fun3 = SegundoApply(teste[, cols]),
  times = 1e3
)

> print(mb, order = "median")
Unit: microseconds
 expr      min        lq     mean    median        uq       max neval cld
 fun3  389.099  418.6315  477.543  440.2565  469.4095  4875.095  1000 a  
 fun2 1918.865 1991.0090 2189.838 2028.4195 2105.4355 11119.613  1000  b 
  fun 1998.850 2055.3400 2295.097 2088.6375 2170.4150  9949.591  1000   c

Browser other questions tagged

You are not signed in. Login or sign up in order to post.