My first thought is that the kmeans
is not a classification method but a Clusterization method. The difference is subtle, but rather important.
The kmeans
is an unsupervised method. There is nothing in this algorithm that is forcing the created groups to resemble the groups of plant species (in this example).
Because it is an unsupervised method, it is difficult to say also which is the best possible cluster... It becomes a somewhat subjective problem. What can be used is:
- the sum of within-group variances: if within each group it is too large, it means that your cluster is not very good
- has also the Rand Index which is implemented in this package
fpc
that Robert spoke in the comments
Finally, answering your questions:
It is classifying in a very wrong way, just you notice that the class individuals setosa
are divided into two clusters: 1 and 2. And cluster 3 contains both class individuals versicolor
how much virginica
. That is, the cluster is not helping to separate plant classes.
I don’t know, but at first, you could say that each label of each cluster, is the one of the class that appears most in that cluster...
I can’t answer.
For me, in your case it would make more sense to use a supervised learning algorithm like random forest
, regressão logística
, knn
and etc.
To illustrate the problem of using kmeans
consider the following database:
dados <- data_frame(
x = runif(10000), y = runif(10000),
grupo = ifelse(x > 0.25 & x < 0.75 & y > 0.25 & y < 0.75, "azul", "vermelho")
)
Note that the group is deterministically created from x
and y
. There is no randomness.
Now run a cluster kmeans
on that basis and we’ll see if the groups look alike.
cluster <- kmeans(dados[,1:2], 2)
table(cluster$cluster, dados$grupo)
azul vermelho
1 1263 3670
2 1273 3794
They didn’t, because at no point did I ask for the kmeans
separate the two groups. It separated only according to the values of x
and y
who were close...
See in the graph how the groups looked:
Now let’s adjust a Forest Random on that data:
dados$grupo <- as.factor(dados$grupo)
rf <- randomForest::randomForest(grupo ~ x + y, data = dados)
table(predict(rf, dados), dados$grupo)
azul vermelho
azul 2536 0
vermelho 0 7464
Now yes! We were able to fix everything that was blue and what was red. This happens because we are supervising the random forest
, that is we are offering ratings for the algorithm to learn.
Look at the help of the function
clusterboot
packagefpc
.– Robert
On question 3, some supervised classifier will be even better than k-averages for this problem, as @Danielfalbel already explained in his answer. I’m not very knowledgeable about R, but if you can use Python check out Scikitlearn: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
– Luiz Vieira
For SVM functions you have the package
kernlab
– Robert
Yes, I’m aware of that. And I know the package
caret
also. But I am not looking for a classifier of this type. I am interested in kmeans and its limitations.– Marcus Nunes