Equivalence to kmeans inside Caret::Train

Asked

Viewed 123 times

3

I tried to adjust a model kmeans within the package caret with the function train. But I checked that it is not available. I generated a frame to do this:

set.seed(15)

d <- data.frame(
  x = replicate(6, rnorm(10000, 1000, 125))
)

cluster <- kmeans(d, centers = 3)
cluster

d$grupo <- as.factor(cluster[['cluster']])

library(recipes)
library(caret)

r <- recipe(grupo ~ ., data = d)
p <- prep(r, d)
b <- bake(p, d)

t <- train(
  r, 
  d, 
  method = 'kmeans', 
  trControl = trainControl(method = 'cv', number = 3)
)

Error: Model kmeans is not in Caret’s built-in library

  • Is there any function equivalent to kmeans so that I can validate the model?

I adjusted with a lda and it was ok, but I would like some guidance regarding kmeans.

1 answer

4


caret is the acronym for Classification Tond REgression TRaining. By definition, it is a package that provides algorithms for data classification and regression.

Classification is what we call a method capable of separating our observations according to predefined classes. This is called supervised learning. Several grading methods are available on caret, as LDA, Random Forest, K Nearest and Similar Neighbors. At this link is the complete list of these methods.

Clustering is what we call a method capable of separating our observations without the need to use predefined classes. It’s called unsupervised learning.

K-Means is a Clusterization method. Therefore, it is not available for the caret. Probably never will be.

So the answer to the question

Is there a function equivalent to kmeans so I can validate the model?

is nay, there is no equivalent function to kmeans in the caret. It is a package that makes classification, not Clusterization.

However, it is possible to use K Means as a classifier. As far as I know, there is no option for this already ready on R, but nothing stops you from programming yours. I wouldn’t recommend it, because K Means has serious problems, like

  1. Does not work well on data with many dimensions

  2. It doesn’t work if the groups have very different sizes

  3. Because it uses Euclidean distance to decide the belonging of observations to groups, it will not work well for data with large asymmetries or many outliers

On the other hand, it seems to me that your problem is something related to classification, because you have access to the classes of each observation. Therefore, any method for classifying caret would serve to train and validate your modeling. If I understand correctly and your problem is classification and not Clusterization, I suggest you give up K Means and go for something more sophisticated.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.