Equivalence to kmeans inside Caret::Train

Question

Equivalence to kmeans inside Caret::Train

Asked 6 years, 4 months ago

Viewed 123 times

3

I tried to adjust a model kmeans within the package caret with the function train. But I checked that it is not available. I generated a frame to do this:

set.seed(15)

d <- data.frame(
  x = replicate(6, rnorm(10000, 1000, 125))
)

cluster <- kmeans(d, centers = 3)
cluster

d$grupo <- as.factor(cluster[['cluster']])

library(recipes)
library(caret)

r <- recipe(grupo ~ ., data = d)
p <- prep(r, d)
b <- bake(p, d)

t <- train(
  r, 
  d, 
  method = 'kmeans', 
  trControl = trainControl(method = 'cv', number = 3)
)

Error: Model kmeans is not in Caret’s built-in library

Is there any function equivalent to kmeans so that I can validate the model?

I adjusted with a lda and it was ok, but I would like some guidance regarding kmeans.

1 answer

Browser other questions tagged r machine-learning

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2019-03-29T15:52:20+00:00

caret is the acronym for Classification Tond REgression TRaining. By definition, it is a package that provides algorithms for data classification and regression.

Classification is what we call a method capable of separating our observations according to predefined classes. This is called supervised learning. Several grading methods are available on caret, as LDA, Random Forest, K Nearest and Similar Neighbors. At this link is the complete list of these methods.

Clustering is what we call a method capable of separating our observations without the need to use predefined classes. It’s called unsupervised learning.

K-Means is a Clusterization method. Therefore, it is not available for the caret. Probably never will be.

So the answer to the question

Is there a function equivalent to kmeans so I can validate the model?

is nay, there is no equivalent function to kmeans in the caret. It is a package that makes classification, not Clusterization.

However, it is possible to use K Means as a classifier. As far as I know, there is no option for this already ready on R, but nothing stops you from programming yours. I wouldn’t recommend it, because K Means has serious problems, like

Does not work well on data with many dimensions
It doesn’t work if the groups have very different sizes
Because it uses Euclidean distance to decide the belonging of observations to groups, it will not work well for data with large asymmetries or many outliers

On the other hand, it seems to me that your problem is something related to classification, because you have access to the classes of each observation. Therefore, any method for classifying caret would serve to train and validate your modeling. If I understand correctly and your problem is classification and not Clusterization, I suggest you give up K Means and go for something more sophisticated.