There’s no magical way to do this. I’m going to propose a way that maybe can speed up your manual work, because at some point you’ll have to check and see which are well ranked or not.
With the following code:
combinar_textos_parecidos <- function(x, max_dist, method = "lv"){
x <- as.character(x)
distancias <- stringdist::stringdistmatrix(x, x, method = method)
for(i in 1:length(x)){
small_dist <- distancias[i,] < max_dist
if(sum(small_dist) > 1){
x[small_dist] <- x[which(small_dist)[1]]
}
}
return(x)
}
Copied from linked question and adapted to choose distance calculation method.
library(stringr)
textos %>% str_replace_all("-", " ") %>%
str_replace_all("[:space:]{1,}", " ") %>%
combinar_textos_parecidos(0.4, method = "jaccard")
[1] "ADASA/DF" "ADASA AGENCIA DE AGUAS" "ADASA/DF" "CASA CIVIL"
[5] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
[9] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
Thus, in your example you would need to change only 1 element of your vector.
Of course this is just an alternative. You could also create an auxiliary table with all possible companies and make a Join using the package fuzzyjoin
. For some reason the package does not Join when one of the tables has only one column (I will open an Issue on github), but you can already use.
Example:
textos <- data.frame(emp = c("ADASA/DF", "ADASA-AGENCIA DE AGUAS", "ADASA-DF",
"CASA CIVIL", "CASA CIVIL - DF", "CASA CIVIL DA GOVERNADORIA DF", "CASA CIVIL DO DF", "CASA CIVIL DO DF.", "CASA CIVIL DO DIST. FEDERAL", "CASA CIVIL DO DISTRITO FEDERAL", "SEC. ESTADO DA CASA CIVIL"),
stringsAsFactors = F, a = 1
)
empresas <- data.frame(emp = c("ADASA", "CASA CIVIL"), emp2 = c("ADASA", "CASA CIVIL"), stringsAsFactors = F)
library(fuzzyjoin)
regex_left_join(textos, empresas, by = c(emp = "emp"))
emp.x a emp.y emp2
1 ADASA/DF 1 ADASA ADASA
2 ADASA-AGENCIA DE AGUAS 1 ADASA ADASA
3 ADASA-DF 1 ADASA ADASA
4 CASA CIVIL 1 CASA CIVIL CASA CIVIL
5 CASA CIVIL - DF 1 CASA CIVIL CASA CIVIL
6 CASA CIVIL DA GOVERNADORIA DF 1 CASA CIVIL CASA CIVIL
7 CASA CIVIL DO DF 1 CASA CIVIL CASA CIVIL
8 CASA CIVIL DO DF. 1 CASA CIVIL CASA CIVIL
9 CASA CIVIL DO DIST. FEDERAL 1 CASA CIVIL CASA CIVIL
10 CASA CIVIL DO DISTRITO FEDERAL 1 CASA CIVIL CASA CIVIL
11 SEC. ESTADO DA CASA CIVIL 1 CASA CIVIL CASA CIVIL
when you speak q is a beginner, are you a beginner in programming? have you ever thought of creating a conversion table using regular expressions?
– gzinho
Dear Mavir, I believe that what you want is not a programmable task without knowing exactly what information should be extracted from each row of this table. The possibilities are virtually infinite and perform a "DISTINCT" in this information, considering that it can be in any position of the text and written in several different ways seems to me an impossible task without a database of correct clients (which would already be the correction by itself), also because any regular expression as suggested above can generate false positives.
– Bruno Bermann
With the information given, I believe you should create a series of
palavras-chave
that should be found in this column, replacing everything else. Ex: I use the keywordCASA CIVIL
, I search all the records containing it and replace the column value with it.– Marcelo de Andrade
When you say you have a "database", you are using some DBMS like Mysql, Oracle, etc... ?
– Marcelo de Andrade
I think an adaptation of the code of the following question might help: http://answall.com/a/146108/6036
– Daniel Falbel
Of course, this form is not as accurate as creating a keyword table.
– Daniel Falbel
@Danielfalbel, I saw this answer. But it would not apply to my case because the texts do not have a standard beginning for everyone as was the case with him. There is no standard distance I could use as a base.
– MaviR
It doesn’t use the beginning to calculate the distance, it can be things in the middle of the sentence...
– Daniel Falbel