Similarity of Texts

Question

Similarity of Texts

Asked 7 years, 7 months ago

Viewed 214 times

2

Good evening, you guys, I would like a help from you, because I am starting in R, and I have a demand, where I have to signal the lines where there are similar phrases. For that, I’m using the stringdist library. However, I can only make the comparison by words in the same position, and I would like to know the similarity of the whole sentence, regardless of the position of the words. For example, in the result below, in the third line is the same sentence, only the words are in different positions. I should consider that phrase is similar.

                              vet1      vet2          vet3     
  heber dos Santos araujo   0.0000000   0.0000000   0.3591486

  heber dos Santos araujo   0.0000000   0.0000000   0.3591486

  araujo Santos dos heber   0.3591486   0.3591486   0.0000000

  heber dos s araujo    0.1372786   0.1372786   0.3955314

The code I’m using is:

library(stringdist)

library(dplyr)

dis<-read.csv2("C:/Users/heber.araujo/Desktop/Estudo Questões Duplicadas/exemploTeste.csv")

library(tm)
stp<-stopwords("portuguese") #'Lista de palavras comuns que ele retira'


dis$Nome<-as.character(dis$Nome) # Coluna para pesquisa
dis$Nome<-removeWords(dis$Nome,stp)

'#for(i in 1:nrow(dis)){  
'# dis_2<-strsplit(dis$text[i]," ")  # esse comando quebra a frase por palavra
'# dis_3<-unlist(dis_2) 

'#dis_3<-dis$GQUE_DS_ENUNCIADO

dis_3<-dis$Nome

res<-stringdistmatrix(dis_3,dis_3,method = "jw")

rownames(res)<-dis_3

1 answer

Browser other questions tagged array r

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2018-01-18T09:59:13+00:00

I believe the following code answers the question.
First I’ll read the data, since we don’t have access to the file exemploTeste.csv.

Nome <- scan(what = character(), text = "
'heber dos Santos araujo'
'heber dos Santos araujo'
'araujo Santos dos heber'
'heber dos s araujo'")

Now the distances will be calculated by a function, heber, who orders the names first, and then calls stringdistmatrix with the sorted names. Thus word order differences disappear.
Originals are not changed.

heber <- function(x, method = "jw"){
    y <- strsplit(x, "[[:space:]]+")
    y <- apply(sapply(y, sort), 2, paste, collapse = " ")
    stringdistmatrix(y, y, method = method)
}

Nome <- removeWords(Nome, stp)
dis_3 <- Nome

res <- heber(dis_3)
rownames(res) <- dis_3
res
#                          [,1]      [,2]      [,3]      [,4]
#heber  Santos araujo 0.0000000 0.0000000 0.0000000 0.0877193
#heber  Santos araujo 0.0000000 0.0000000 0.0000000 0.0877193
#araujo Santos  heber 0.0000000 0.0000000 0.0000000 0.0877193
#heber  s araujo      0.0877193 0.0877193 0.0877193 0.0000000