First, ask yourself if you really need to have a categorical variable with this amount of levels. When dividing a factor of n levels, the Forest Random performs 2 n-2 possible divisions of this variable to choose the best split point. In this case, there are 9.00719925e15 possible results.
If your computer can take 0.001 seconds to make each division of this, it will take 285616 years to complete the modeling. That’s longer than we humans exist as a species on Earth.
First of all, I would ask myself why this variable has so many levels.
Would it be a numerical variable that was read incorrectly? In this case, treat it as numbers and not categories.
If the variable is categorical, is it possible to treat it as an ordinal variable? If it is, the Forest Random can be faster to classify ordinal variables than nominal variables.
If the variable is categorical nominal, is it possible to simplify it into fewer categories? For example, if they are countries of the world, it is possible to create a new variable called continent that will have only 6 levels?
If the variable is categorical nominal, is it possible to simplify it into fewer categories? Are all levels representative? It would be possible to combine the lower frequency levels in a new level called Other?
These are some of the ones I know are patterns in a problem like this. It will not be possible to adjust this model without turning this variable into something simpler.
I’m already working on a way to simplify it, the bigger problem is that in a
lm()
R 2 increases by 0.35 with the addition of it. I was tempted to simply applyas.numeric()
in this factor.– Márcio Mocellin