Text Cleaning and Standardization

Asked

Viewed 108 times

3

I have an Excel table already in R with two columns (code and company name) and 22 thousand rows. The names have been filled freely, so the same name is written in different forms, abbreviated and in full.

I’ve been able to turn everything into a high box, but I need to turn the names into a single pattern, so I don’t repeat the same company with different descriptions.

Examples of situations of different descriptions for the same company:

  1. ADASA/DF, ADASA-AGENCIA DE AGUAS, ADASA-DF Would be transformed to ADASA;

  2. CASA CIVIL, CASA CIVIL - DF, CASA CIVIL DA GOVERNADORIA DF, CASA CIVIL DO DF, CASA CIVIL DO DF., CASA CIVIL DO DIST. FEDERAL, CASA CIVIL DO DISTRITO FEDERAL, SEC. ESTADO DA CASA CIVIL, etc... would be transformed to CASA CIVIL.

I searched for similar questions on the site, but found no answer.
Since I’m a beginner in programming and R, I may not be sure which is the best term to research.

After standardizing the text of the "company name" column, I need to get as a result all the codes associated with that company, without repeating the company name.

  • 1

    when you speak q is a beginner, are you a beginner in programming? have you ever thought of creating a conversion table using regular expressions?

  • 1

    Dear Mavir, I believe that what you want is not a programmable task without knowing exactly what information should be extracted from each row of this table. The possibilities are virtually infinite and perform a "DISTINCT" in this information, considering that it can be in any position of the text and written in several different ways seems to me an impossible task without a database of correct clients (which would already be the correction by itself), also because any regular expression as suggested above can generate false positives.

  • 1

    With the information given, I believe you should create a series of palavras-chave that should be found in this column, replacing everything else. Ex: I use the keyword CASA CIVIL, I search all the records containing it and replace the column value with it.

  • When you say you have a "database", you are using some DBMS like Mysql, Oracle, etc... ?

  • I think an adaptation of the code of the following question might help: http://answall.com/a/146108/6036

  • Of course, this form is not as accurate as creating a keyword table.

  • @Danielfalbel, I saw this answer. But it would not apply to my case because the texts do not have a standard beginning for everyone as was the case with him. There is no standard distance I could use as a base.

  • 1

    It doesn’t use the beginning to calculate the distance, it can be things in the middle of the sentence...

Show 3 more comments

1 answer

2


There’s no magical way to do this. I’m going to propose a way that maybe can speed up your manual work, because at some point you’ll have to check and see which are well ranked or not.

With the following code:

combinar_textos_parecidos <- function(x, max_dist, method = "lv"){
  x <- as.character(x)
  distancias <- stringdist::stringdistmatrix(x, x, method = method)
  for(i in 1:length(x)){
    small_dist <- distancias[i,] < max_dist
    if(sum(small_dist) > 1){
      x[small_dist] <- x[which(small_dist)[1]] 
    }
  }
  return(x)
}

Copied from linked question and adapted to choose distance calculation method.

library(stringr)
textos %>% str_replace_all("-", " ") %>%
  str_replace_all("[:space:]{1,}", " ") %>%
  combinar_textos_parecidos(0.4, method = "jaccard") 

 [1] "ADASA/DF"               "ADASA AGENCIA DE AGUAS" "ADASA/DF"               "CASA CIVIL"            
 [5] "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"            
 [9] "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"            

Thus, in your example you would need to change only 1 element of your vector.

Of course this is just an alternative. You could also create an auxiliary table with all possible companies and make a Join using the package fuzzyjoin. For some reason the package does not Join when one of the tables has only one column (I will open an Issue on github), but you can already use.

Example:

textos <- data.frame(emp = c("ADASA/DF",  "ADASA-AGENCIA DE AGUAS", "ADASA-DF",
                                          "CASA CIVIL", "CASA CIVIL - DF", "CASA CIVIL DA GOVERNADORIA DF", "CASA CIVIL DO DF", "CASA CIVIL DO DF.", "CASA CIVIL DO DIST. FEDERAL", "CASA CIVIL DO DISTRITO FEDERAL", "SEC. ESTADO DA CASA CIVIL"),
                     stringsAsFactors = F, a = 1
)

empresas <- data.frame(emp = c("ADASA", "CASA CIVIL"), emp2 = c("ADASA", "CASA CIVIL"), stringsAsFactors = F)
library(fuzzyjoin)
regex_left_join(textos, empresas, by = c(emp = "emp"))

                            emp.x a      emp.y       emp2
1                        ADASA/DF 1      ADASA      ADASA
2          ADASA-AGENCIA DE AGUAS 1      ADASA      ADASA
3                        ADASA-DF 1      ADASA      ADASA
4                      CASA CIVIL 1 CASA CIVIL CASA CIVIL
5                 CASA CIVIL - DF 1 CASA CIVIL CASA CIVIL
6   CASA CIVIL DA GOVERNADORIA DF 1 CASA CIVIL CASA CIVIL
7                CASA CIVIL DO DF 1 CASA CIVIL CASA CIVIL
8               CASA CIVIL DO DF. 1 CASA CIVIL CASA CIVIL
9     CASA CIVIL DO DIST. FEDERAL 1 CASA CIVIL CASA CIVIL
10 CASA CIVIL DO DISTRITO FEDERAL 1 CASA CIVIL CASA CIVIL
11      SEC. ESTADO DA CASA CIVIL 1 CASA CIVIL CASA CIVIL

Browser other questions tagged

You are not signed in. Login or sign up in order to post.