How to correctly identify clusters using kmeans?

Question

How to correctly identify clusters using kmeans?

Asked 9 years, 7 months ago

Viewed 673 times

6

Suppose I wish to classify the specimens of the iris dataset using the k-Means method. Also, I want to assess whether the rating was good or not. The easiest way to do this is as follows:

iris.kmeans <- kmeans(iris[, 1:4], 3)
table(iris$Species, iris.kmeans$cluster)

              1  2  3
  setosa     17 33  0
  versicolor  4  0 46
  virginica   0  0 50

However, I cannot identify if the results are good or not. Apparently, class 3 is equivalent to virginic species, class 2 corresponds to setosa and class 1 corresponds to versicolor. My questions are:

1) How can I be sure if my above statement is correct? How to make sure that k-Means is not classifying the specimens very wrong?

2) Is there any automated way for my table to have species names in rows and columns, instead of only in rows?

3) Is there any function of any other R package that is better than the original R kmeans function?

2

Look at the help of the function clusterboot package fpc.

– Robert

2016/08/16 at 18:32
On question 3, some supervised classifier will be even better than k-averages for this problem, as @Danielfalbel already explained in his answer. I’m not very knowledgeable about R, but if you can use Python check out Scikitlearn: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html

– Luiz Vieira

2016/08/16 at 21:23
For SVM functions you have the package kernlab

– Robert

2016/08/17 at 11:47
Yes, I’m aware of that. And I know the package caret also. But I am not looking for a classifier of this type. I am interested in kmeans and its limitations.

– Marcus Nunes

2016/08/17 at 11:50

1 answer

Browser other questions tagged r machine-learning k-means

You are not signed in. Login or sign up in order to post.

by Daniel Falbel • **12,504** points · Answer 1 · 2016-08-16T20:23:42+00:00

My first thought is that the kmeans is not a classification method but a Clusterization method. The difference is subtle, but rather important.

The kmeans is an unsupervised method. There is nothing in this algorithm that is forcing the created groups to resemble the groups of plant species (in this example).

Because it is an unsupervised method, it is difficult to say also which is the best possible cluster... It becomes a somewhat subjective problem. What can be used is:

the sum of within-group variances: if within each group it is too large, it means that your cluster is not very good
has also the Rand Index which is implemented in this package fpc that Robert spoke in the comments

Finally, answering your questions:

It is classifying in a very wrong way, just you notice that the class individuals setosa are divided into two clusters: 1 and 2. And cluster 3 contains both class individuals versicolor how much virginica. That is, the cluster is not helping to separate plant classes.
I don’t know, but at first, you could say that each label of each cluster, is the one of the class that appears most in that cluster...
I can’t answer.

For me, in your case it would make more sense to use a supervised learning algorithm like random forest, regressão logística, knn and etc.

To illustrate the problem of using kmeans consider the following database:

dados <- data_frame(
  x = runif(10000), y = runif(10000), 
  grupo = ifelse(x > 0.25 & x < 0.75 & y > 0.25 & y < 0.75, "azul", "vermelho")
  )

Note that the group is deterministically created from x and y. There is no randomness.

Now run a cluster kmeans on that basis and we’ll see if the groups look alike.

cluster <- kmeans(dados[,1:2], 2)
table(cluster$cluster, dados$grupo)


    azul vermelho
  1 1263     3670
  2 1273     3794

They didn’t, because at no point did I ask for the kmeans separate the two groups. It separated only according to the values of x and y who were close...

See in the graph how the groups looked:

Now let’s adjust a Forest Random on that data:

dados$grupo <- as.factor(dados$grupo)
rf <- randomForest::randomForest(grupo ~ x + y, data = dados)
table(predict(rf, dados), dados$grupo)


           azul vermelho
  azul     2536        0
  vermelho    0     7464

Now yes! We were able to fix everything that was blue and what was red. This happens because we are supervising the random forest, that is we are offering ratings for the algorithm to learn.