Is there any way to eliminate duplicate elements that aren’t exactly the same?

Asked

Viewed 101 times

2

dados1 <- c("10 ANOS DA POLÍTICA NACIONAL DE PROMOÇÃO DA SAÚDE: TRAJETÓRIAS E DESAFIOS", "4-CYCLOPROPYL-1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE AND ETHYL 1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE-4-CARBOXYLATE","7,7-DIMETHYLAPORPHINE AND OTHER ALKALOIDS FROM THE BARK OF", "ABSCESSO DO MÚSCULO PSOAS ASSOCIADO À INFECÇÃO POR MYCOBACTERIUM TUBERCULOSIS EM PACIENTE COM AIDS", "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE","ABUNDANCE OF LUTZOMYIA LONGIPALPIS", "ABUSO E DEPENDÊNCIA DE DROGAS NA PERSPECTIVA DA SAÚDE PÚBLICA (EDITORIAL)")

qualis <- c("A2", "B3", "A1", "B2", "A2", "A2", "A1")

m <- data.frame("Título da Produção" = dados1,
                "Qualis" = qualis,
                "Ano" = c(2010:2016))

The above df is only illustrative. Note that the fifth and sixth element of "data1" are practically the same thing, but as they are not written in the same way I cannot use duplicated or Unique.

Is there any other option to clear these lines by filtering by name?

1 answer

3


I made a function that can help you. It uses the package stringdist that calculates the distance between strings:

combinar_textos_parecidos <- function(x, max_dist){
  x <- as.character(x)
  distancias <- stringdist::stringdistmatrix(x, x)
  for(i in 1:length(x)){
    small_dist <- distancias[i,] < max_dist
    if(sum(small_dist) > 1){
      x[small_dist] <- x[which(small_dist)[1]] 
    }
  }
  return(x)
}

See what it returns when I apply it to its vector Título.da.Produção. Now items 5 and 6 have exactly the same name.

combinar_textos_parecidos(m$Título.da.Produção, 10)
[1] "10 ANOS DA POLÍTICA NACIONAL DE PROMOÇÃO DA SAÚDE: TRAJETÓRIAS E DESAFIOS"                                                                            
[2] "4-CYCLOPROPYL-1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE AND ETHYL 1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE-4-CARBOXYLATE"
[3] "7,7-DIMETHYLAPORPHINE AND OTHER ALKALOIDS FROM THE BARK OF"                                                                                           
[4] "ABSCESSO DO MÚSCULO PSOAS ASSOCIADO À INFECÇÃO POR MYCOBACTERIUM TUBERCULOSIS EM PACIENTE COM AIDS"                                                   
[5] "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE"                                                                                                             
[6] "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE"                                                                                                             
[7] "ABUSO E DEPENDÊNCIA DE DROGAS NA PERSPECTIVA DA SAÚDE PÚBLICA (EDITORIAL)" 

So making:

m$Título.da.Produção <- combinar_textos_parecidos(m$Título.da.Produção, 10)
m[!duplicated(m$Título.da.Produção),]

Line 5 would be excluded.

Obs I used distance 10 as a cut-off point. Maybe you want to be more or less tolerant to the proximity of the strings. To do this just control the parameter max_dist of my function.

You can read more about calculating distances here or by typing help("stringdist-metrics") on your R console.

  • Thank you very much.

  • dados$Título.da.Produção <- combinar_textos_parecidos(dados$Título.da.Produção, 10)&#xA;Error in distancias[i, ] : subscript out of bounds thanks. but I’m still trying to understand the problem.

  • This error is probably happening because you don’t have this column in your database. See what happens when I do the function of NULL: > combinar_textos_parecidos(NULL, 10)&#xA;Error in distancias[i, ] : subscript out of bounds.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.