How to solve the 53 categories limit of R randomForest?

Question

How to solve the 53 categories limit of R randomForest?

Asked 6 years, 8 months ago

Viewed 91 times

6

In R, using the library randomForest, when executed randomForest() I receive the following error message:

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.

The factor in question has 57 categories. How can I change this limit or get around this problem?

1 answer

Browser other questions tagged r randomforest

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2018-11-05T16:54:52+00:00

First, ask yourself if you really need to have a categorical variable with this amount of levels. When dividing a factor of n levels, the Forest Random performs 2 n-2 possible divisions of this variable to choose the best split point. In this case, there are 9.00719925e15 possible results.

If your computer can take 0.001 seconds to make each division of this, it will take 285616 years to complete the modeling. That’s longer than we humans exist as a species on Earth.

First of all, I would ask myself why this variable has so many levels.

Would it be a numerical variable that was read incorrectly? In this case, treat it as numbers and not categories.
If the variable is categorical, is it possible to treat it as an ordinal variable? If it is, the Forest Random can be faster to classify ordinal variables than nominal variables.
If the variable is categorical nominal, is it possible to simplify it into fewer categories? For example, if they are countries of the world, it is possible to create a new variable called continent that will have only 6 levels?
If the variable is categorical nominal, is it possible to simplify it into fewer categories? Are all levels representative? It would be possible to combine the lower frequency levels in a new level called Other?

These are some of the ones I know are patterns in a problem like this. It will not be possible to adjust this model without turning this variable into something simpler.