2
Good evening, you guys, I would like a help from you, because I am starting in R, and I have a demand, where I have to signal the lines where there are similar phrases. For that, I’m using the stringdist library. However, I can only make the comparison by words in the same position, and I would like to know the similarity of the whole sentence, regardless of the position of the words. For example, in the result below, in the third line is the same sentence, only the words are in different positions. I should consider that phrase is similar.
vet1 vet2 vet3
heber dos Santos araujo 0.0000000 0.0000000 0.3591486
heber dos Santos araujo 0.0000000 0.0000000 0.3591486
araujo Santos dos heber 0.3591486 0.3591486 0.0000000
heber dos s araujo 0.1372786 0.1372786 0.3955314
The code I’m using is:
library(stringdist)
library(dplyr)
dis<-read.csv2("C:/Users/heber.araujo/Desktop/Estudo Questões Duplicadas/exemploTeste.csv")
library(tm)
stp<-stopwords("portuguese") #'Lista de palavras comuns que ele retira'
dis$Nome<-as.character(dis$Nome) # Coluna para pesquisa
dis$Nome<-removeWords(dis$Nome,stp)
'#for(i in 1:nrow(dis)){
'# dis_2<-strsplit(dis$text[i]," ") # esse comando quebra a frase por palavra
'# dis_3<-unlist(dis_2)
'#dis_3<-dis$GQUE_DS_ENUNCIADO
dis_3<-dis$Nome
res<-stringdistmatrix(dis_3,dis_3,method = "jw")
rownames(res)<-dis_3