Bootstrap in linear regression model - Calculating the importance of variables

Asked

Viewed 101 times

3

I’m calculating the importance of variables for multiple regression with function varImp package caret. But when doing hair using function and bootstrap I cannot recover the values as I got for R².

How can I save the importance values of coefficients in . csv, for example?

Replicable example:

library(boot)
library(caret)

imp_lm <- function(data, indices) {
  d <- data[indices,] 

  fit.all <- lm(d$mpg~.,data=d)
  return(varImp(fit.all, scale = FALSE))
}

results <- boot(data=mtcars, statistic=imp_lm, R=10)

Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on Matrix

1 answer

4


Note that in function help boot, the argument statistic has the following description (emphasis added):

A Function which when Applied to data Returns a vector containing the statistic(s) of interest.

As we rotate the function varImp, we get the following:

fit.all <- lm(mpg ~ ., data=mtcars)
resultado <- varImp(fit.all, scale = FALSE)
is.data.frame(resultado)
## [1] TRUE

So the result of your function imp_lm returns a data frame, because the function varImp returns a data frame. One way around this is to change your function by placing return(varImp(fit.all, scale = FALSE)[, 1]) at the end and thus extracting the first column of the result that calculates the importance of the variables:

imp_lm <- function(data, indices) {
  d <- data[indices, ] 

  fit.all <- lm(mpg ~ ., data=d)
  return(varImp(fit.all, scale = FALSE)[, 1])
}

results <- boot(data=mtcars, statistic=imp_lm, R=10)
results

## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = imp_lm, R = 10)
## 
## 
## Bootstrap Statistics :
##       original     bias    std. error
## t1*  0.1066392  0.8799646   0.7188905
## t2*  0.7467585 -0.1206878   0.4692999
## t3*  0.9868407 -0.1654184   0.5865589
## t4*  0.4813036  0.5936594   0.7967201
## t5*  1.9611887 -0.8193792   0.5743548
## t6*  1.1234133  0.2350501   0.7057048
## t7*  0.1509915  0.9933979   0.8952965
## t8*  1.2254035  0.5388327   1.3083746
## t9*  0.4389142  0.6839702   0.8997836
## t10* 0.2406258  0.9177274   1.3346145
  • For the example worked very well, I’m assuming that t1 - cyl, t2 - Disp, T3 - hp... follows the order of disposition of the dataframe (mtcars), that is, is correct? When I apply to my dataset it only works for less than 30 variables, the more it keeps the same error.

  • 1

    For the example worked very well, I am assuming that t1 - cyl, t2 - Disp, T3 - hp... follows the order of disposition of the dataframe (mtcars), that is, it is correct? Yes, this is correct. As for the other question, it escapes the scope of the original question. As far as I could see, the problem proposed with the sample dataset was solved in the correct way.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.