There’s no magical way to do this. I’m going to propose a way that maybe can speed up your manual work, because at some point you’ll have to check and see which are well ranked or not.
With the following code:
combinar_textos_parecidos <- function(x, max_dist, method = "lv"){
x <- as.character(x)
distancias <- stringdist::stringdistmatrix(x, x, method = method)
for(i in 1:length(x)){
small_dist <- distancias[i,] < max_dist
if(sum(small_dist) > 1){
x[small_dist] <- x[which(small_dist)[1]]
}
}
return(x)
}
Copied from linked question and adapted to choose distance calculation method.
library(stringr)
textos %>% str_replace_all("-", " ") %>%
str_replace_all("[:space:]{1,}", " ") %>%
combinar_textos_parecidos(0.4, method = "jaccard")
[1] "ADASA/DF" "ADASA AGENCIA DE AGUAS" "ADASA/DF" "CASA CIVIL"
[5] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
[9] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
Thus, in your example you would need to change only 1 element of your vector.
Of course this is just an alternative. You could also create an auxiliary table with all possible companies and make a Join using the package fuzzyjoin. For some reason the package does not Join when one of the tables has only one column (I will open an Issue on github), but you can already use.
Example:
textos <- data.frame(emp = c("ADASA/DF", "ADASA-AGENCIA DE AGUAS", "ADASA-DF",
"CASA CIVIL", "CASA CIVIL - DF", "CASA CIVIL DA GOVERNADORIA DF", "CASA CIVIL DO DF", "CASA CIVIL DO DF.", "CASA CIVIL DO DIST. FEDERAL", "CASA CIVIL DO DISTRITO FEDERAL", "SEC. ESTADO DA CASA CIVIL"),
stringsAsFactors = F, a = 1
)
empresas <- data.frame(emp = c("ADASA", "CASA CIVIL"), emp2 = c("ADASA", "CASA CIVIL"), stringsAsFactors = F)
library(fuzzyjoin)
regex_left_join(textos, empresas, by = c(emp = "emp"))
emp.x a emp.y emp2
1 ADASA/DF 1 ADASA ADASA
2 ADASA-AGENCIA DE AGUAS 1 ADASA ADASA
3 ADASA-DF 1 ADASA ADASA
4 CASA CIVIL 1 CASA CIVIL CASA CIVIL
5 CASA CIVIL - DF 1 CASA CIVIL CASA CIVIL
6 CASA CIVIL DA GOVERNADORIA DF 1 CASA CIVIL CASA CIVIL
7 CASA CIVIL DO DF 1 CASA CIVIL CASA CIVIL
8 CASA CIVIL DO DF. 1 CASA CIVIL CASA CIVIL
9 CASA CIVIL DO DIST. FEDERAL 1 CASA CIVIL CASA CIVIL
10 CASA CIVIL DO DISTRITO FEDERAL 1 CASA CIVIL CASA CIVIL
11 SEC. ESTADO DA CASA CIVIL 1 CASA CIVIL CASA CIVIL
when you speak q is a beginner, are you a beginner in programming? have you ever thought of creating a conversion table using regular expressions?
– gzinho
Dear Mavir, I believe that what you want is not a programmable task without knowing exactly what information should be extracted from each row of this table. The possibilities are virtually infinite and perform a "DISTINCT" in this information, considering that it can be in any position of the text and written in several different ways seems to me an impossible task without a database of correct clients (which would already be the correction by itself), also because any regular expression as suggested above can generate false positives.
– Bruno Bermann
With the information given, I believe you should create a series of
palavras-chavethat should be found in this column, replacing everything else. Ex: I use the keywordCASA CIVIL, I search all the records containing it and replace the column value with it.– Marcelo de Andrade
When you say you have a "database", you are using some DBMS like Mysql, Oracle, etc... ?
– Marcelo de Andrade
I think an adaptation of the code of the following question might help: http://answall.com/a/146108/6036
– Daniel Falbel
Of course, this form is not as accurate as creating a keyword table.
– Daniel Falbel
@Danielfalbel, I saw this answer. But it would not apply to my case because the texts do not have a standard beginning for everyone as was the case with him. There is no standard distance I could use as a base.
– MaviR
It doesn’t use the beginning to calculate the distance, it can be things in the middle of the sentence...
– Daniel Falbel