As far as I could find, there is no tool that gives the difference between the points or tell which ones are closer or further away. My suggestion is to use other functions to make this differentiation.
There are a few ways you can do this. I used the norm
of numpy
, but maybe the transform
kmeans be a better option after all. It depends on how you are handling the data, I prefer via numpy
because I am more accustomed to working directly with the numbers.
I worked in 2D to be easier to demonstrate, but the method is valid for 3D too, just change the Plots (and the initial data, of course). My way was:
- Make a cluster (I used the
numpy
),
- Use k-Means to find clusters and their centers with
fit
,
- Separate the points by cluster and calculate the distance (using the
norm
) between the points and their respective centers,
- Separate the points the distance is greater than the average of the distances.
- Plot the points so as to differentiate between the nearest and the other.
Notes
In general you have to define where is your "cut" that differentiates which point is near or far. I used any value greater than the average, but in general this is not perfect. I have already put a factor (factor
) multiplying the average. So just change the value of factor
to have a different output, without having to touch the most central parts of the script.
I made several variables, basically one for each Plot. This is obviously not necessary, but I thought it would be simpler to expose how the code works. If you follow this path to separate the points, in general only using the logical indices (as I separate the external points) is enough.
Below follows the code I made.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#fator the ajuste quanto fora está seu limite
factor=1.0 #usa qq valor maior que a média
a=np.random.rand(50,2)
b=np.random.rand(30,2)*0.15+0.2 #cluster 1
c=np.random.rand(30,2)*0.15+0.6 #cluster 2
d=np.concatenate((a,b,c))
fig, ax=plt.subplots()
ax.scatter(d[:,0],d[:,1])
plt.show()
#separa os clusters
cl=KMeans(n_clusters=2).fit(d)
#só pra verificar
cl.cluster_centers_ #centros
d[cl.labels_==0].size
d[cl.labels_==1].size
d.size
#separa os dados
datac0=d[cl.labels_==0]
datac1=d[cl.labels_==1]
#segundo plot
fig2, ax2=plt.subplots()
ax2.scatter(datac0[:,0],datac0[:,1],c=[ 0, 0, 1])
ax2.scatter(datac1[:,0],datac1[:,1],c=[ 1, 0, 0])
plt.show()
#distancia dos pontos aos seus respectivos centros
distdata0=datac0-cl.cluster_centers_[0]
distdata1=datac1-cl.cluster_centers_[1]
#pontos externos
#aqui separa os valores que são maiores que a media das distâncias
#usei a "norm" pra calcular a distância absoluta entre o ponto e o seu respectivo centro
#depois comparo este valor a media das distâncias, se for maior, vai pra lista
meandist0=np.mean(np.linalg.norm(distdata0,axis=1))
meandist1=np.mean(np.linalg.norm(distdata1,axis=1))
outc0=datac0[np.linalg.norm(distdata0,axis=1) >factor*meandist0]
outc1=datac1[np.linalg.norm(distdata1,axis=1) >factor*meandist1]
#plot os pontos mais externos marcados com "x"
fig3, ax3=plt.subplots()
ax3.scatter(datac0[:,0],datac0[:,1],c=[ 0, 0, 1])
ax3.scatter(datac1[:,0],datac1[:,1],c=[ 1, 0, 0])
#Plota os pontos mais externos "marcados"
ax3.scatter(outc0[:,0],outc0[:,1],c=[ 0, 1, 1],marker='x')
ax3.scatter(outc1[:,0],outc1[:,1],c=[ 1, 1, 0],marker='x')
plt.show()
This gives the result with these 3 charts
With the original images on : Dots, Clusters, Clusters and points.
Opa, very interesting face the ideas and the methods that you proposed, I think it will really solve my problem, I will see the necessary adaptations, finally thank you, recently adapting me to work with "data science".
– Rafael