How to customize matplotlib Plots, in a more Clusterization-oriented way?

Asked

Viewed 295 times

0

I am working with data analysis, I need to leave the data more distant from the clusters represented in some other way, as a different symbol or some stronger color, I analyzed in the documentation but I did not find anything of the kind. The logic that I thought was of a data with distance N from the cluster has this change, I am applying the sckit-Learn clustering technique.

inserir a descrição da imagem aqui

The circulated data would be an example of those who wish to apply this.

1 answer

1

As far as I could find, there is no tool that gives the difference between the points or tell which ones are closer or further away. My suggestion is to use other functions to make this differentiation.

There are a few ways you can do this. I used the norm of numpy, but maybe the transform kmeans be a better option after all. It depends on how you are handling the data, I prefer via numpy because I am more accustomed to working directly with the numbers.

I worked in 2D to be easier to demonstrate, but the method is valid for 3D too, just change the Plots (and the initial data, of course). My way was:

  1. Make a cluster (I used the numpy),
  2. Use k-Means to find clusters and their centers with fit,
  3. Separate the points by cluster and calculate the distance (using the norm) between the points and their respective centers,
  4. Separate the points the distance is greater than the average of the distances.
  5. Plot the points so as to differentiate between the nearest and the other.

Notes

  • In general you have to define where is your "cut" that differentiates which point is near or far. I used any value greater than the average, but in general this is not perfect. I have already put a factor (factor) multiplying the average. So just change the value of factor to have a different output, without having to touch the most central parts of the script.

  • I made several variables, basically one for each Plot. This is obviously not necessary, but I thought it would be simpler to expose how the code works. If you follow this path to separate the points, in general only using the logical indices (as I separate the external points) is enough.

Below follows the code I made.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

#fator the ajuste quanto fora está seu limite
factor=1.0 #usa qq valor maior que a média

a=np.random.rand(50,2)
b=np.random.rand(30,2)*0.15+0.2  #cluster 1
c=np.random.rand(30,2)*0.15+0.6  #cluster 2
d=np.concatenate((a,b,c))  

fig, ax=plt.subplots()
ax.scatter(d[:,0],d[:,1])
plt.show()

#separa os clusters
cl=KMeans(n_clusters=2).fit(d)

#só pra verificar
cl.cluster_centers_ #centros
d[cl.labels_==0].size
d[cl.labels_==1].size
d.size

#separa os dados
datac0=d[cl.labels_==0]
datac1=d[cl.labels_==1]

#segundo plot
fig2, ax2=plt.subplots()
ax2.scatter(datac0[:,0],datac0[:,1],c=[ 0, 0, 1])
ax2.scatter(datac1[:,0],datac1[:,1],c=[ 1, 0, 0])
plt.show()


#distancia dos pontos aos seus respectivos centros
distdata0=datac0-cl.cluster_centers_[0]
distdata1=datac1-cl.cluster_centers_[1]

#pontos externos
#aqui separa os valores que são maiores que a media das distâncias
#usei a "norm" pra calcular a distância absoluta entre o ponto e o seu respectivo centro
#depois comparo este valor a media das distâncias, se for maior, vai pra lista
meandist0=np.mean(np.linalg.norm(distdata0,axis=1))
meandist1=np.mean(np.linalg.norm(distdata1,axis=1))
outc0=datac0[np.linalg.norm(distdata0,axis=1) >factor*meandist0]
outc1=datac1[np.linalg.norm(distdata1,axis=1) >factor*meandist1]

#plot os pontos mais externos marcados com "x"
fig3, ax3=plt.subplots()
ax3.scatter(datac0[:,0],datac0[:,1],c=[ 0, 0, 1])
ax3.scatter(datac1[:,0],datac1[:,1],c=[ 1, 0, 0])
#Plota os pontos mais externos "marcados"
ax3.scatter(outc0[:,0],outc0[:,1],c=[ 0, 1, 1],marker='x')
ax3.scatter(outc1[:,0],outc1[:,1],c=[ 1, 1, 0],marker='x')
plt.show()

This gives the result with these 3 charts

BabouBabouBabou

With the original images on : Dots, Clusters, Clusters and points.

  • 1

    Opa, very interesting face the ideas and the methods that you proposed, I think it will really solve my problem, I will see the necessary adaptations, finally thank you, recently adapting me to work with "data science".

Browser other questions tagged

You are not signed in. Login or sign up in order to post.