How to group identifiers that are related across rows and columns in R

Asked

Viewed 31 times

0

Hello. I’m working with base linkage and have now 2 columns of paired Ids (id_a and id_b). Each pair represents the same individual located on different bases:

  id_a <- c(12,15,68663,34,34,34,20,1001) 
  id_b <- c(67764,68663,68667,14,19,1001,20,2112)
  input <- data.frame(id_a,id_b)

Thus, the id 15 is related to id 68663 and also to id 68667 (via id 68663); these 4 grouped ids refer to the same individual.

I need that output:

output <- data.frame(id_linked_1 = c(12, 14, 15, 20),
                     id_linked_2 = c(67764, 19, 68663, 20),
                     id_linked_3 = c(NA, 34, 68667, NA),
                     id_linked_4 = c(NA, 1001, NA, NA),
                     id_linked_5 = c(NA, 2112, NA, NA))

That is, I need to group all the pairs that are related. The order of this grouping is not relevant to the analysis.

Example: 15//68663//68667 and 15//68667//68663 are the same.

Thank you.

1 answer

2


A solution can be found with the package igraph, for graph problems. It seems natural to treat the question problem as a problem of finding the connected components of a graph.

1. Load the package igraph and create a graph from the base.

library(igraph)

g <- graph_from_data_frame(input)
plot(g, vertex.size = 30, vertex.color = 'lightgrey', edge.arrow.width = 0.5)

inserir a descrição da imagem aqui

2. Now determine the components. The function all_simple_paths is an option. But I will use the function subcomponent passing each of the vertices of the first column of the base input.

v_num <- unique(match(input[[1]], names(V(g))))
path_list <- mapply(subcomponent, list(g), v_num)
names(path_list) <- v_num

3. Now, knowing which ones are unique, the above code gives a component to each vector element input$id_a, possibly with repetitions.
Note: here is called the function str_sort of one more external package, stringr. This serves to sort the names of vertices, which are of classes "character", in numerical order but not absolutely necessary for the final result. The order given by the base function sort also serves.

path_list2 <- lapply(path_list, function(p){
  p <- unlist(p, recursive = FALSE)
  stringr::str_sort(unique(names(p)), numeric = TRUE)
})

path_list2
#$`1`
#[1] "12"    "67764"
#
#$`2`
#[1] "15"    "68663" "68667"
#
#$`3`
#[1] "15"    "68663" "68667"
#
#$`4`
#[1] "14"   "19"   "34"   "1001" "2112"
#
#$`5`
#[1] "20"
#
#$`6`
#[1] "14"   "19"   "34"   "1001" "2112"

4. These vectors are the vectors of the graph above but there are in fact repetitions, one must stay with each of these vectors.

final <- lapply(seq_along(path_list2), function(i){
  keep <- sapply(seq_along(path_list2)[-seq_len(i)], function(j){
    length(intersect(path_list2[[i]], path_list2[[j]])) == 0
  })
  if(all(keep)) path_list2[[i]] else NULL
})

final <- final[lengths(final) > 0]
final
#[[1]]
#[1] "12"    "67764"
#
#[[2]]
#[1] "15"    "68663" "68667"
#
#[[3]]
#[1] "20"
#
#[[4]]
#[1] "14"   "19"   "34"   "1001" "2112"

Now we only have one of each, which corresponds to the desired result.

5. With the format of the question.

sapply(final, paste, collapse = "//")
#[1] "12//67764"              "15//68663//68667"       "20"                    
#[4] "14//19//34//1001//2112"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.