How to discover relevant properties of a basis for a Python classification algorithm?

Asked

Viewed 106 times

0

I have a database (excel spreadsheet) about the health of the elderly with about 112 columns and I would like to know the best algorithm to extract some of these columns, maintaining the variability of the data and not losing the reference to the names of the selected ones (this is possible?).

In previous tests, I used the PCA but the resulting components do not have a significant name.

To put it in context, the main idea is to use an algorithm that extracts columns from my database in order to eliminate the strong correlation between them, and then use some sort of classification algorithm (K-Means, DBSCAN...) to classify each person (healthy, unhealthy, among others...).

I’m using the library scikit-Learn at the moment

  • It would not be easier to mount a query in the query BD, just by selecting the fields you need?

  • I’m actually working with an excel spreadsheet! At first I selected only a few columns, but when talking with a teacher specialized in AM I was informed that choosing the columns randomly is not a good way and that the correct one would be to use some algorithm for this.

2 answers

0

Igor. Try to use some more basic techniques of exploratory analysis of your data, such as loading the data into a Pandas dataframe and taking its correlations with data.corr(). This function already calculates the correlation between the columns. After that you can use a heat map (heatmap) to better view your correlations, and thus delete the columns you deem unnecessary.

Maybe the post on this link can help you automate it.

  • Thank you Matheus! Apparently what you suggested might work, I will test

0

Hello,

about the KPD, I think you had an incorrect understanding of what it proposes to do. It is useful when you want to reduce the number of dimensions of a data set, as it transforms combinations of attributes into a smaller number of attributes and with minimal loss of information. This way, you could use the PCA in its original base and create the classifier from that base generated by the PCA. This has the advantage of guaranteeing the minimum loss of information when removing columns, but the major disadvantage is the difficulty of interpretation.

If you just want to discover the least influential fields, you can use methods like Selectkbest.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

Here is a very interesting article on the subject: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e

Browser other questions tagged

You are not signed in. Login or sign up in order to post.