Improve performance for predictive model creation

Asked

Viewed 222 times

1

I am creating a predictive model in R, using the library Caret. When I run on R it takes a long time, and still gives some errors. In comparison, I run the same base on Weka in a matter of a few minutes already get the result.

I’ve already changed the variables to integer, and it still didn’t do much good.

I’ve also tried to use it in parallel, but it didn’t do much good either.

I was wondering what the performance is connected to in this case? What are the factors that most influence poor performance in creating a predictive model?

  • The caret does parameter tuning by default... Are you sure that’s not it? Sometimes he’s training 30 models instead of 1 as you might be thinking and unlike Weka. Also have to see what template you are making. For example to make Forest Random, the Caret can use both the package randomForest as to the ranger (and others), but always has 1 that is faster.

  • How does the tuning work? I got to run with tuning=10 and 20. But I also did not get better. I was initially doing kNN, because it is faster, but I tested the randomForest and also fall into the same problem... delay and no result. When I turn to a sample of the dataset it even works. Keeping up with memory performance, I believe he’s training several models, just like I set up for one?

  • You have to see your code... You mess with the argument tuneGrid. You also have to touch the tuneCtrl pq Caret also does CV. How to do it is here: https://topepo.github.io/caret/model-training-and-tuning.html#basic-Parameter-tuning If you put a minimal reproducible example in your question it is easier to answer. The way your question is the answer has to be too long to answer.

  • This parameter number is the number of Folds cross-validation. In other words, 10 means that for each element of the parameter grid vc will adjust 10x the model, to evaluate the error in the other part of the base.

  • inTain <- createDataPartition(y = make.names(rf$class), p = 0.7, list = FALSE)&#xA;&#xA;training <- rf[inTain,]&#xA;teste <- rf[-inTain,]&#xA;&#xA;set.seed(234)&#xA;train_control <- trainControl(method="cv", number=10)&#xA;&#xA;model <- train(as.factor(class) ~., &#xA; data = training, &#xA; trControl=train_control, &#xA; method="rf") I am running this code for a table with more or less 50k rows and 60 columns

  • Right, but when I spin at Weka I also do a CV of 10.

  • After about an hour of running, it was wrong again. Error in Train.default(x, y, Weights = w, ...) : Stopping In addition: There Were 50 or more warnings (use warnings() to see the first 50)

Show 2 more comments

1 answer

1


You may have a number of reasons for slowing down:

  • Slow algorithm. O randomForest is not the fastest package: try to use ranger or Rborist. Source. xgboost also is fast for damn and making some tweaks to adjust.
  • Caret is tuning the parameters. Pass only a combination of hyper-parameters using the argument tuneGrid.
  • It is very likely that the R algorithms have lower performance than Weka (in C), but you can use Weka for R (look for the package RWeka)

Hard to say why mistakes happen without seeing your data. I would guess that it is because it has some of its variables that it has a rare class and when you cross-validate, some of the Folds is without it.

Always try to find some package of R that uses some C/C++ algorithm to train the models. In this part of machine-Learning, R should be considered only as an interface, to use algorithms from multiple sources more easily and generally standardized.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.