Clustering in the R

Question

Clustering in the R

Asked 7 years, 9 months ago

Viewed 206 times

1

Guys, I need to Clusterize this database and then make the prediction.... I wonder, how could I make the substitution in the right way in this case ?

Which type of Clusterization would fit best?

I’m a beginner in the data area and I’m trying to solve this problem because I believe it will be a great challenge for my learning.

To reinforce: I would like to turn the data into numbers so that I could read it through kmeans, for example.... But I accept suggestions.

1

Start with the function by seeing the hclust. And please don’t post data that way, never a graphic file, put the output from dput(dados).

– Rui Barradas

2017/10/19 at 15:25
Oops, thanks Rui. Is this was my first post...

– Leonardo Ferreira

2017/10/20 at 20:14

1 answer

Browser other questions tagged r date dplyr k-means

You are not signed in. Login or sign up in order to post.

by Homunculus • **101** points · Answer 1 · 2017-10-23T00:23:46+00:00

Just what you described in the question makes it difficult to give you a punctual answer. I suggest that next time, or if this answer is not satisfactory, explain a little about what the database describes.

The first thing is to do a pre-processing job. This will depend on the type of algorithm you want to implement. But if it’s an algorithm like K-Means, identifying outliers and doing some sort of imputation is almost essential. After all, it’s based on the average.

The k-Means algorithm is one of the top 10 algorithms used in data mining (link), and it’s been invented for a while. Knowing that, I think it would be a great start to work with this algorithm, but grouping time series. Time series are data that have parameters as a function of time. In your case, one idea is to try to group weeks with similar behaviors among themselves. This you can do with different kinds of attributes. You can see which weeks have a similar behavior when it comes to order_status, price, etc..

Another good algorithm for those starting out is DBSCAN, which is a density algorithm (k-Means is prototype-base, I don’t know how to translate). It’s very simple and you don’t even have to worry about outliers, as they are very likely to be discarded. But, I leave the job of seeing where to implement with you, who has a better sense of where this database came from.

Good luck.