K-Means Algorithm

Asked

Viewed 253 times

0

Good morning, I will apply the K-Means in a given dataset, I already executed normal for fully numerical dataset, ex: Iris( vesicolor, setosa, virginica), but I came across a dataset with categorical data (names and numbers), and I need to do the k-executionSomeone knows how to place the clusters function for categorical attributes?

Obs: The subject is data privacy, so you will have data that will be deleted, but in this question, you should know more about how to make clusters with categorical data

  • I didn’t quite understand the question. Could you add examples of what you have as input/output and what is expected? Non-numerical entries are categorical (such as azul or vermelho) or text (such as names, descriptions)?

2 answers

1


Most ML algorithms do not accept categorical data as input, so it is necessary to apply transformations to the data to make them suitable for use, some of the most common techniques used are:

  • label encoding - this technique assigns numbers to categorical variables where, for example, data A, B and C will become 1, 2 and 3. One should use this technique carefully, because algorithms can interpret that there is an ordinal relationship between categories

  • one hot encoding - this technique converts each categorical value to a column in the dataset, containing the values 1 for when it was present in the source column and 0 when absent. In this case the cardinality of the origin column should be observed, so as not to increase the dimensionality of dataset

  • Word embedding - Technique used to convert into numbers the semantic value of each word, widely used in NLP.

It is important to verify which transformation best applies in the categorical data, and apply in the data, after that, they can be used as input in clustering or classification models.

0

TL;DR

Based on the simplicity of the question: Convert the data to numeric or create a new variable of the numeric type based on the categorical that you have, I do not know what you are using for the "treatment" of the dataset, below I leave a very simple example using pandas.DataFrame which is very common in the python world:

import pandas as pd

dataset = {'aluno': [1, 1, 1, 2, 2], 
           'periodo': [1, 2, 1, 1, 1],
           'nota': ['A', 'D', 'C', 'D', 'A']}

df = pd.DataFrame(dataset, columns = ['aluno', 'periodo', 'nota'])
print('','DataFrame Original:', df, sep='\n')

In the example the original dataset has the two columns aluno and periodo of the numerical type and nota type categorica, the output to the above code is:

DataFrame Original:
   aluno  periodo nota
0      1        1    A
1      1        2    D
2      1        1    C
3      2        1    D
4      2        1    A

Now, in the code below, we create a new column of the numerical type (nota_numerica) column-based nota

notas = {'A': 100, 'B': 80, 'C': 60, 'D': 40}
df['nota_numerica'] = df['nota'].apply(notas.get)
print('','DataFrame Modificado:', df, sep='\n')

The exit to this new code would be:

DataFrame Modificado:
   aluno  periodo nota  nota_numerica
0      1        1    A            100
1      1        2    D             40
2      1        1    C             60
3      2        1    D             40
4      2        1    A            100

As I said before, as well as the question, it is a very simple example, of course that depending on the complexity of the real problem you can elaborate a more elaborate solution with the pandas itself.

See working on repl it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.