Finding the ideal number of clusters is not a trivial task. In general, unsupervised learning tasks are complicated to solve precisely because we don’t know the answer to the problem. Logical, when using the data set iris
we already know how many species of plants are present, but in the real world, a task of Clusterization does not give us this information.
Fortunately, there are methods that can be used to suggest a solution. One way to try to find the optimal number of clusters in a task like this is by using the packages NbClust
and factoextra
. I will illustrate three methods of this package here in this answer.
base <- iris
base2 <- base[3:4]
library(NbClust)
library(factoextra)
fviz_nbclust(base2, kmeans, method = "wss")
The first method is the wss
(or Within sum of Squares). It uses the sum of squares to find the ideal number of clusters. The suggested way to do this is somewhat subjective: look for the elbow in the above graph (i.e., the point at which it stabilizes) and this is the suggested number of clusters. In the case of this example, there are 3.
fviz_nbclust(base2, kmeans, method = "silhouette")
The silhouette method, which calculates the distances of each point from all other points in the sample, gives us another value: 2 clusters only. He doesn’t seem to separate the species well versicolor and virginica.
fviz_nbclust(base2, kmeans, method = "gap_stat")
Finally, the gap method computes the dispersions within each cluster and ends up, in this case, agreeing with the wss method, suggesting again 3 clusters.
Therefore, there is no definitive way to state the optimal number of clusters in an analysis like this. What I do, when addressing a problem like this, is to apply the three methods and choose the value that most repeats as the optimal number of clusters. In the case of this example, this value is 3.
If all three values differ, then I try to talk to an expert in the field, see what he thinks is most befitting of reality. If there isn’t an expert in the field available, I try to justify my decision based on subjective criteria. For example, I would list the elements belonging to each group and try to justify why 3 groups are better than 2 or 4 if I have three different results.
Uganda the functions
unique(iris$Species)
orlevels(iris$Species)
?– Willian Vieira
Sorry @Willian Vieira, I don’t understand your question.
– Izak Mandrak