In R, correct typing errors followed by a database processing

Asked

Viewed 391 times

7

This problem is quite complex and needs two stages. The first stage consists of correcting typing errors in a database (perhaps a probabilistic solution). The second stage is to tidy up this database after this correction. This second stage requires a sequence of applications from the dplyr package (or another appropriate and elegant package)

Let’s go to the first stage. I have a company database. The database provided does not fully reveal the identity of the worker. I will illustrate the basis and then explain the variables.

data <- read.table(text="
              cpf;nome;m1;m2;m3;m4;m5;m6;m7;m8;m9;m10;m11;m12;salario
              100001;Maria dos Santos Magalhães;1;0;0;0;0;0;0;0;1;0;0;0;1234
              100001;Maria Santos Magalhães;0;1;1;1;1;1;1;1;0;1;1;1;1034
              100002;Lucas Barbosa;1;1;1;1;1;1;1;1;1;1;1;1;4234 
              100002;Danilo Carvalho;1;1;1;1;1;1;1;1;1;1;1;0;7234
              100003;Paulo Silva de Fonseca;0;1;1;1;1;1;1;1;1;1;1;0;1254
              100003;Paulo Silva da Fonseca;0;0;0;0;0;0;0;0;0;0;0;1;2234
              100003;Wagner Silva Junior;1;1;1;0;0;0;0;0;0;0;0;0;4234
              100003;Paulo Silva Fonseca;1;0;0;0;0;0;0;0;0;0;0;0;1232
              100004;Ricardo Colho;1;1;1;1;1;1;1;0;1;1;1;0;5234
              100004;Ricardo Coelho;0;0;0;0;0;0;0;1;0;0;0;1;1234", h=T, sep=";")

Explaining the variables. First, we don’t have the complete Cpf, we only have the 6 middle numbers. The variable "name" needs no explanation. The variables of type M1,m2,m3, etc., are the months. These variables are binary and 1 represents that the worker worked in the month in question and 0 who did not work. The variable "salary" is the value that the worker earned in the messes worked. The data presented here are fictitious.

First thing to look at every set of cpfs is that there are typos. For example, the group whose middle Cpf number is 100001, we have a great chance that Maria dos Santos Magalhães and Maria Santos Magalhães are the same person. Another evidence is that if it were two different people, they would probably have months of work in common, as is the case of Cpf 100002, where Lucas Barbosa and Danilo Carvalho are different people. Other cases follow the same explanation.

I need some kind of algorithm to tell me, for example, that Maria dos Santos Magalhães and Maria Santos Magalhães are, as high probability, the same person. Just like Lucas Barbosa and Danilo Carvalho are practically different people.

An attempt using adist:

teste<- data[data$cpf == 100003 , ]
(ch1<- teste$nome) 
[1] Paulo Silva de Fonseca Paulo Silva da Fonseca Wagner Silva Junior   
[4] Paulo Silva Fonseca   
10 Levels: Danilo Carvalho Lucas Barbosa ... Wagner Silva Junior

(d1 <- ch1 %>% adist())
         [,1] [,2] [,3] [,4]
[1,]        0    1   14    3
[2,]        1    0   14    3
[3,]       14   14    0   11
[4,]        3    3   11    0

I will delete those that have zero distance and less than 5 as default. But first I will name the rows and columns.

(d1<- as.data.frame(d1)) 
names(d1)<- ch1
row.names(d1)<- ch1
thresh=5
(teste<- which(d1 != 0 & d1 < thresh, arr.ind=TRUE) )
                        row col
 Paulo Silva da Fonseca   2   1
 Paulo Silva Fonseca      4   1
 Paulo Silva de Fonseca   1   2
 Paulo Silva Fonseca      4   2
 Paulo Silva de Fonseca   1   4
 Paulo Silva da Fonseca   2   4

Note that in this particular case, Wagner Silva Junior has no connection with the others. From now on, the second stage begins: With this matrix of distances, I would like to do a series of manipulations in order to tidy up the names, the months worked and the salary. In short, I would like something like this:

      cpf                       nome m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 salario

2  100001     Maria Santos Magalhães  1  1  1  1  1  1  1  1  1   1   1   1    2268
3  100002              Lucas Barbosa  1  1  1  1  1  1  1  1  1   1   1   1    4234
4  100002            Danilo Carvalho  1  1  1  1  1  1  1  1  1   1   1   0    7234
5  100003     Paulo Silva de Fonseca  1  1  1  1  1  1  1  1  1   1   1   1    4720
7  100003        Wagner Silva Junior  1  1  1  0  0  0  0  0  0   0   0   0    4234
9  100004              Ricardo Colho  1  1  1  1  1  1  1  1  1   1   1   1    6468

I believe that a number of functions using dplyr can solve this second stage

  • managed to solve?

  • @Flaviobarros , No

  • 1

    @orrillo still interested in the solution? because I have something in mind, and I can develop the answer!

  • You can post. It is always good to contribute

1 answer

4

Introducing

I had a similar problem to this in 2017, and I solved it from the Fuzzy logic, based on this link 2016.

In the link, the author explains that he takes each sentence (which in this case is a full name), separates by word and uses the distance from Levenshtein to calculate how far one word is from the other, and finally, ponder by the size of the words (briefly this is it).

In the end, the code results in a Fuzzy score of word matching (the larger the better, and +Inf implies an exact match).

Correct typos

The functions are loaded:

############################################################################ ###
### FUZZY MATCHING FUNCTIONS                                                  ####
############################################################################ ###
fuzzy_prep_words <- function(words) {
  # Prepares a list of words for fuzzy matching. All the other fuzzy matching
  # functions will run word through this. Given a list of sentences, returns
  # a list of words.

  words <- unlist(strsplit(tolower(gsub("[[:punct:]]", " ", words)), "\\W+"))
  return(words)
}

fuzzy_gen_word_freq <- function(l, fun = identity) {
  # Returns a word frequency vector based on vector of sentences l and with
  # frequencies post-processed by fun (e.g. log)

  fun(sort(table(fuzzy_prep_words(unlist(strsplit(l, ' ')))), decreasing=T))+1
}

fuzzy_title_match <- function(a, b, wf) {
  # Fuzzy matches a performance title based on a custom algorithm tuned for
  # this purpose. Words are frequency-weighted (like tf-idf).
  # 
  # Args:
  #   a, b: the two titles to match
  #   wf: a vector of word frequencies as generated by fuzzy_gen_word_freq
  #
  # Returns:
  #   A fuzzy match score, higher is better, +Inf for exact match

  if (a == b) # Shortcut to make faster
    return (Inf)
  a.words <- fuzzy_prep_words(a)
  b.words <- fuzzy_prep_words(b)
  a.freqs <- sapply(a.words, function(x) { ifelse(is.na(wf[x]), 1, wf[x]) })
  b.freqs <- sapply(b.words, function(x) { ifelse(is.na(wf[x]), 1, wf[x]) })

  d <- adist(a.words, b.words)
  a.matches <- 1-apply(d, 1, function(x) { min(x, 2) })/2
  b.matches <- 1-apply(d, 2, function(x) { min(x, 2) })/2

  matchsum <- min(sum(a.matches * 1/a.freqs), sum(b.matches * 1/b.freqs))
  unmatchsum <- sum(floor(1-a.matches) * 1/a.freqs) + sum(floor(1-b.matches) * 1/b.freqs)
  return(matchsum / unmatchsum)
}

Names you want to calculate the score Fuzzy (in which case, it’s the same name):

A <- as.character(data$nome)
B <- as.character(data$nome)

Apply on the job:

wf <- fuzzy_gen_word_freq(c(A, B))
vectorised_match <- function (L1,L2) { mapply(function(a,b) { fuzzy_title_match(a, b, wf) }, L1, L2) } 
scores <- outer(A, B, vectorised_match)
rownames(scores) <- A
colnames(scores) <- B

Returns an array with the Fuzzy score of the distance between each of the names:

> scores
                           Maria dos Santos Magalhães Maria Santos Magalhães Lucas Barbosa Danilo Carvalho Paulo Silva de Fonseca Paulo Silva da Fonseca Wagner Silva Junior
Maria dos Santos Magalhães                        Inf                    1.8             0               0             0.00000000             0.00000000          0.00000000
Maria Santos Magalhães                            1.8                    Inf             0               0             0.00000000             0.00000000          0.00000000
Lucas Barbosa                                     0.0                    0.0           Inf               0             0.00000000             0.00000000          0.00000000
Danilo Carvalho                                   0.0                    0.0             0             Inf             0.00000000             0.00000000          0.00000000
Paulo Silva de Fonseca                            0.0                    0.0             0               0                    Inf                    Inf          0.08641975
Paulo Silva da Fonseca                            0.0                    0.0             0               0                    Inf                    Inf          0.08641975
Wagner Silva Junior                               0.0                    0.0             0               0             0.08641975             0.08641975                 Inf
Paulo Silva Fonseca                               0.0                    0.0             0               0             1.19047619             1.19047619          0.11666667
Ricardo Colho                                     0.0                    0.0             0               0             0.00000000             0.00000000          0.00000000
Ricardo Coelho                                    0.0                    0.0             0               0             0.00000000             0.00000000          0.00000000
                           Paulo Silva Fonseca Ricardo Colho Ricardo Coelho
Maria dos Santos Magalhães           0.0000000             0              0
Maria Santos Magalhães               0.0000000             0              0
Lucas Barbosa                        0.0000000             0              0
Danilo Carvalho                      0.0000000             0              0
Paulo Silva de Fonseca               1.1904762             0              0
Paulo Silva da Fonseca               1.1904762             0              0
Wagner Silva Junior                  0.1166667             0              0
Paulo Silva Fonseca                        Inf             0              0
Ricardo Colho                        0.0000000           Inf            Inf
Ricardo Coelho                       0.0000000           Inf            Inf

In theory, only, this should work, but not the case:

A <- as.data.frame(A)
names(A) <- c("nome")
A$Nome1 <- colnames(scores)[apply(scores,1,which.max)]
A
                         nome                      Nome1
1  Maria dos Santos Magalhães Maria dos Santos Magalhães
2      Maria Santos Magalhães     Maria Santos Magalhães
3               Lucas Barbosa              Lucas Barbosa
4             Danilo Carvalho            Danilo Carvalho
5      Paulo Silva de Fonseca     Paulo Silva de Fonseca
6      Paulo Silva da Fonseca     Paulo Silva de Fonseca
7         Wagner Silva Junior        Wagner Silva Junior
8         Paulo Silva Fonseca        Paulo Silva Fonseca
9               Ricardo Colho              Ricardo Colho
10             Ricardo Coelho              Ricardo Colho

But as can be seen, it didn’t work for the "Marias" and the "Paulo", so more manipulations are needed in this distance matrix (the logic Fuzzy has already stopped here, from now on, is a way I thought to solve).

I basically take the score and put it in the table of names, make a table from the Score, filter those with values greater than 1, and assign the names (from the table of names) to the scores:

scores[is.infinite(scores)] <- NA # Faço isso, pois a correspondência exata de nomes já foi atribuída
A$Pontuacao <- as.numeric(apply(scores,1,max, na.rm = T))
tab <- table(A$Pontuacao)
# O ponto de corte a ser definido foi "1", mas isso pode ser melhor estudado a partir de uma amostra maior
aux <- tab[tab>1]
aux <- aux[names(aux)!="0"]
aux <- as.data.frame(aux)
aux <- aux[, -2, drop = F]
names(aux) <- "Pontuacao"
aux <- merge(aux, A[, c(2,3)], all.x = T)

Here, from the original name table, I merge with the table I created, and reassign the names from ifelse:

aux <- aux[!duplicated(aux$Pontuacao), ]
aux$Pontuacao <- as.character(aux$Pontuacao)
A$Pontuacao <- as.character(A$Pontuacao)
names(aux)[2] <- "NomeNovo"
A <- merge(A, aux, all.x = T)
A$NomeNovo <- ifelse(is.na(A$NomeNovo), as.character(A$Nome1), as.character(A$NomeNovo))

           Pontuacao                       nome                      Nome1                   NomeNovo
1                  0              Ricardo Colho              Ricardo Colho              Ricardo Colho
2                  0             Ricardo Coelho              Ricardo Colho              Ricardo Colho
3                  0              Lucas Barbosa              Lucas Barbosa              Lucas Barbosa
4                  0            Danilo Carvalho            Danilo Carvalho            Danilo Carvalho
5  0.116666666666667        Wagner Silva Junior        Wagner Silva Junior        Wagner Silva Junior
6   1.19047619047619        Paulo Silva Fonseca        Paulo Silva Fonseca     Paulo Silva de Fonseca
7   1.19047619047619     Paulo Silva de Fonseca     Paulo Silva de Fonseca     Paulo Silva de Fonseca
8   1.19047619047619     Paulo Silva da Fonseca     Paulo Silva de Fonseca     Paulo Silva de Fonseca
9                1.8 Maria dos Santos Magalhães Maria dos Santos Magalhães Maria dos Santos Magalhães
10               1.8     Maria Santos Magalhães     Maria Santos Magalhães Maria dos Santos Magalhães

Processing of a database

Now it’s easy! Just merge the original data, with the table of names sorted, and add the columns grouping by name and CPF:

data2 <- merge(data, A[, c("nome", "NomeNovo")])
# Somo os salários e os meses pelo CPF e Nome
data2 <- aggregate(. ~ NomeNovo + cpf, data2[, -c(1)], sum)

# Aqui, é possível fazer um teste se algum nome foi agrupado de forma indevida
data2 <- data2[rowSums(data2[, -c(1:2, ncol(data2))])<=12, ]
# Como Não foi, eu paro por aqui, mas outras melhorias poderiam ser feitas
data2
                    NomeNovo    cpf m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 salario
1 Maria dos Santos Magalhães 100001  1  1  1  1  1  1  1  1  1   1   1   1    2268
2            Danilo Carvalho 100002  1  1  1  1  1  1  1  1  1   1   1   0    7234
3              Lucas Barbosa 100002  1  1  1  1  1  1  1  1  1   1   1   1    4234
4     Paulo Silva de Fonseca 100003  1  1  1  1  1  1  1  1  1   1   1   1    4720
5        Wagner Silva Junior 100003  1  1  1  0  0  0  0  0  0   0   0   0    4234
6              Ricardo Colho 100004  1  1  1  1  1  1  1  1  1   1   1   1    6468

Browser other questions tagged

You are not signed in. Login or sign up in order to post.