7
This problem is quite complex and needs two stages. The first stage consists of correcting typing errors in a database (perhaps a probabilistic solution). The second stage is to tidy up this database after this correction. This second stage requires a sequence of applications from the dplyr package (or another appropriate and elegant package)
Let’s go to the first stage. I have a company database. The database provided does not fully reveal the identity of the worker. I will illustrate the basis and then explain the variables.
data <- read.table(text="
cpf;nome;m1;m2;m3;m4;m5;m6;m7;m8;m9;m10;m11;m12;salario
100001;Maria dos Santos Magalhães;1;0;0;0;0;0;0;0;1;0;0;0;1234
100001;Maria Santos Magalhães;0;1;1;1;1;1;1;1;0;1;1;1;1034
100002;Lucas Barbosa;1;1;1;1;1;1;1;1;1;1;1;1;4234
100002;Danilo Carvalho;1;1;1;1;1;1;1;1;1;1;1;0;7234
100003;Paulo Silva de Fonseca;0;1;1;1;1;1;1;1;1;1;1;0;1254
100003;Paulo Silva da Fonseca;0;0;0;0;0;0;0;0;0;0;0;1;2234
100003;Wagner Silva Junior;1;1;1;0;0;0;0;0;0;0;0;0;4234
100003;Paulo Silva Fonseca;1;0;0;0;0;0;0;0;0;0;0;0;1232
100004;Ricardo Colho;1;1;1;1;1;1;1;0;1;1;1;0;5234
100004;Ricardo Coelho;0;0;0;0;0;0;0;1;0;0;0;1;1234", h=T, sep=";")
Explaining the variables. First, we don’t have the complete Cpf, we only have the 6 middle numbers. The variable "name" needs no explanation. The variables of type M1,m2,m3, etc., are the months. These variables are binary and 1 represents that the worker worked in the month in question and 0 who did not work. The variable "salary" is the value that the worker earned in the messes worked. The data presented here are fictitious.
First thing to look at every set of cpfs is that there are typos. For example, the group whose middle Cpf number is 100001, we have a great chance that Maria dos Santos Magalhães and Maria Santos Magalhães are the same person. Another evidence is that if it were two different people, they would probably have months of work in common, as is the case of Cpf 100002, where Lucas Barbosa and Danilo Carvalho are different people. Other cases follow the same explanation.
I need some kind of algorithm to tell me, for example, that Maria dos Santos Magalhães and Maria Santos Magalhães are, as high probability, the same person. Just like Lucas Barbosa and Danilo Carvalho are practically different people.
An attempt using adist:
teste<- data[data$cpf == 100003 , ]
(ch1<- teste$nome)
[1] Paulo Silva de Fonseca Paulo Silva da Fonseca Wagner Silva Junior
[4] Paulo Silva Fonseca
10 Levels: Danilo Carvalho Lucas Barbosa ... Wagner Silva Junior
(d1 <- ch1 %>% adist())
[,1] [,2] [,3] [,4]
[1,] 0 1 14 3
[2,] 1 0 14 3
[3,] 14 14 0 11
[4,] 3 3 11 0
I will delete those that have zero distance and less than 5 as default. But first I will name the rows and columns.
(d1<- as.data.frame(d1))
names(d1)<- ch1
row.names(d1)<- ch1
thresh=5
(teste<- which(d1 != 0 & d1 < thresh, arr.ind=TRUE) )
row col
Paulo Silva da Fonseca 2 1
Paulo Silva Fonseca 4 1
Paulo Silva de Fonseca 1 2
Paulo Silva Fonseca 4 2
Paulo Silva de Fonseca 1 4
Paulo Silva da Fonseca 2 4
Note that in this particular case, Wagner Silva Junior has no connection with the others. From now on, the second stage begins: With this matrix of distances, I would like to do a series of manipulations in order to tidy up the names, the months worked and the salary. In short, I would like something like this:
cpf nome m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 salario
2 100001 Maria Santos Magalhães 1 1 1 1 1 1 1 1 1 1 1 1 2268
3 100002 Lucas Barbosa 1 1 1 1 1 1 1 1 1 1 1 1 4234
4 100002 Danilo Carvalho 1 1 1 1 1 1 1 1 1 1 1 0 7234
5 100003 Paulo Silva de Fonseca 1 1 1 1 1 1 1 1 1 1 1 1 4720
7 100003 Wagner Silva Junior 1 1 1 0 0 0 0 0 0 0 0 0 4234
9 100004 Ricardo Colho 1 1 1 1 1 1 1 1 1 1 1 1 6468
I believe that a number of functions using dplyr can solve this second stage
managed to solve?
– Flavio Barros
@Flaviobarros , No
– orrillo
@orrillo still interested in the solution? because I have something in mind, and I can develop the answer!
– Guilherme Parreira
You can post. It is always good to contribute
– orrillo