Is it possible to use K-Means (or another Clusterization method) with point limits?

Question

Is it possible to use K-Means (or another Clusterization method) with point limits?

Asked 5 years, 5 months ago

Viewed 38 times

0

I am developing a cluster code with k-Means and I have the following question: It is possible to determine the point limits per cluster with k-Means or another algorithm?

Explaining the case better, in the code below, I have two predetermined centroids and 12 points. After running k-Means, we have 8 points on centroid 0 and 4 points on centroid 1.

from sklearn.cluster import KMeans
import numpy as np

#Centroids:
refs = [[-22.87042313, -43.33995681], [-22.91265768, -43.23596109]]
kmeans_model = KMeans(n_clusters=len(refs), random_state=0).fit(refs)
ref_labels = kmeans_model.labels_
centroids = kmeans_model.cluster_centers_

#Points:
points = [[-22.8595871, -43.2385504], [-23.0144844, -43.4727984], [-22.8727929, -43.4082954],
          [-22.9478637, -43.3652225], [-22.8213579, -43.1740529], [-22.9592171, -43.3508173],
          [-22.8236928, -43.3203929], [-22.9027656, -43.3541462], [-22.8749724, -43.5034297],
          [-22.8456399, -43.2840653], [-22.8893855, -43.2424886], [-22.8499984, -43.2564374]]

#Clustering:
kmeans_model.predict(points)
Output: array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1], dtype=int32)

I can determine how many points will be in each cluster and have a sort of 'spare''?

For example:

centroid 0 = 4 points

centroid 1 = 3 points

run the k-Means...

saída: [1,0,0,0,1,0,NA,NA,NA,NA,1,NA] The NA values would be the "surplus", values that are not close enough to achieve a "vacancy" in the cluster.

1 answer

Browser other questions tagged python k-means

You are not signed in. Login or sign up in order to post.

by AlexCiuffa • **2,402** points · Answer 1 · 2020-03-13T18:20:22+00:00

Maybe you should change your approach. If the goal is to have "values that are not close enough to achieve a 'vacancy' in the cluster", a density Clusterization approach seems more appropriate.

I suggest trying the DBSCAN (or even the OPTICS), which is implemented in sklearn, so just import and use the algorithm:

clustering = DBSCAN(eps=3, min_samples=2).fit(X)

Try to optimize the Epsilon parameter (eps), which represents the maximum distance between two points for one to be considered in the vicinity of the other.