Cluster analysis by groups

Asked

Viewed 118 times

3

I’m trying an analysis of cluster for several groups within a dataframe, with the aim of returning the characteristics of this analysis (ex. the resulting groups) in a database through the function tidy (broom).

dput

dataset=structure(list(a = c(28L, 19L, 92L, 35L, 42L, 82L, 91L, 98L, 
58L, 58L, 92L, 61L, 67L, 73L, 4L, 35L, 9L, 17L, 7L, 82L, 24L,   
51L, 45L, 1L, 97L, 97L, 99L, 5L, 67L, 97L, 95L, 77L, 56L, 67L, 
80L, 22L, 87L, 31L, 97L, 15L, 12L, 94L, 18L, 86L, 1L, 99L, 2L, 
88L, 84L, 65L, 59L, 38L, 8L, 46L, 66L, 30L, 32L, 36L, 17L, 35L, 
40L, 16L, 60L, 28L, 47L, 56L, 82L, 88L, 76L, 38L, 88L, 61L, 26L, 
64L, 24L, 48L, 30L, 68L, 88L, 42L, 62L, 12L, 76L, 37L, 25L, 91L, 
18L, 76L, 13L, 24L, 49L, 89L, 35L, 88L, 19L, 24L, 62L, 91L, 99L,  
18L), b = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("group1", 
"group2", "group3", "group4"), class = "factor"), c = c(61L, 
28L, 82L, 38L, 22L, 79L, 7L, 12L, 73L, 78L, 17L, 28L, 30L, 11L, 
99L, 47L, 42L, 51L, 13L, 16L, 35L, 51L, 92L, 41L, 45L, 27L, 17L, 
37L, 27L, 53L, 23L, 50L, 81L, 25L, 93L, 11L, 80L, 35L, 32L, 9L, 
56L, 18L, 17L, 63L, 49L, 11L, 26L, 93L, 45L, 7L, 43L, 90L, 31L, 
80L, 53L, 66L, 62L, 13L, 54L, 7L, 20L, 37L, 79L, 52L, 35L, 8L, 
6L, 46L, 35L, 3L, 18L, 82L, 92L, 80L, 8L, 87L, 89L, 20L, 26L, 
86L, 29L, 55L, 46L, 83L, 66L, 25L, 17L, 68L, 21L, 83L, 26L, 97L, 
54L, 71L, 19L, 6L, 20L, 86L, 83L, 8L)), class = "data.frame", row.names = c(NA, 
-100L))

I tried that:

library(dplyr)
library(broom)

res1<-dataset%>%
group_by(b)%>%
do(cluster= 
       kmeans(dataset[,c(1,3)],centers=3))

res2<-tidy(res1,cluster)

But I don’t get what I want (the resulting dataframe should have 100 lines, each with its respective group, derived from the analysis). There’s an error in my code, or, this function is not suitable to perform this action.

1 answer

4


This function is not suitable for this action, at least not the way it is being used here. The trick is to use the function nest package tidyr:

library(dplyr)
library(tidyr
cluster <- dataset %>%
  nest(a, c) %>%
  mutate(model = map(data, kmeans, 3),
         centers = map(model, tidy))  

With her, I can tell how the R shall group in a list (in this case a list with levels(dataset$b) elements) the columns that interest me to make my Clusterization. Then I use mutate to actually find the Clusterization of this data.

See that the result is consistent with the expected:

cluster %>%
  unnest(centers)

        b       x1       x2 size withinss cluster
1  group1 82.62500 20.75000    8 2529.375       1
2  group1 56.50000 83.83333    6 5278.333       2
3  group1 24.36364 39.00000   11 4378.545       3
4  group2 86.40000 23.20000   10 2952.000       1
5  group2 13.25000 30.00000    8 2861.500       2
6  group2 81.57143 73.28571    7 2947.143       3
7  group3 40.20000 73.70000   10 4321.700       1
8  group3 76.50000 33.50000    6 2337.000       2
9  group3 33.33333 18.00000    9 3200.000       3
10 group4 87.25000 62.75000    8 5699.000       1
11 group4 31.62500 75.37500    8 2415.750       2
12 group4 37.00000 18.44444    9 4592.222       3

The problem is that we still don’t have what really interests you, which is the cluster to which each observation belongs. But we have the centers of each cluster. So we can predict, through Euclidean distance, which is the cluster of each observation. For this, we will use the function cl_predict package clue:

dataset %>%
  filter(b=="group1") %>%
  select(-b) %>%
  cl_predict(cluster$model[[1]], .)

Class ids:
 [1] 3 3 2 3 3 2 1 1 2 2 1 1 1 1 2 3 3 3 3 1 3 3 2 3 1

I couldn’t make this prediction for all the models at once. To get all 100 necessary predictions, you would have to somehow do the cluster$model[[1]] vary, whether in some way Tidy, or using a for.

Another thing I don’t know how to do either is to cluster the data with different cluster numbers per group. 3 clusters were searched in all 4 groups of the variable b. I don’t know if this would be a reasonable thing to do in practice.

But these two tasks I will leave to the reader : )

Browser other questions tagged

You are not signed in. Login or sign up in order to post.