I had a similar problem here in my company. Basically, here we use two ways to identify homonyms with typos.
The first is the distance Levenshtein and the second is the soundex function().
LEVENSHTEIN
Paraphrasing Wikipedia "the Levenshtein distance or editing distance between two "strings" is given by the minimum number of operations needed to transform one string into the other."
For example, the distance Levenshtein between "Guilherme Silva" and "Guilherme da Silva" is 3. Between "Maria Dores" and "Maria das Dores" is 4.
Follow here the code to create the Levenshtein() function in Mysql.
DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
To use, in your case, you can make the following query:
SELECT * FROM usuario as us WHERE levenshtein(us.nome, 'Wallace Silva') < 5 //Ou outro indice. Coloquei 5 para que ele pegue somente nomes que tenham no maximo indice 4 na distancia levenshtein.
If you have in the bank the following records
--------------------
1 Wallace Silva
2 Wallace João da Silva
3 Wallace das Silva
4 Guilherme da Silva
--------------------
The query will return, in this case, only records 1 and 2.
SOUNDEX
The soundex() function is a function that can be used in database searches where you know the pronunciation but not exactly how it is written.
soundex() is already native to Mysql and we don’t need to create it manually.
In your case, the consultation would look something like this:
SELECT * FROM pessoa WHERE soundex(nome) = soundex('Wallace Silva')
The problem with using soundex() is that it works better with English words and does not work 100% in Portuguese.
COMBINED MODE
We can also create a query in which we combine the two forms of search so that we can have a higher probability of success.
We can make it so:
SELECT * FROM pessoa WHERE levenshtein(soundex(nome), soundex('Wallace Silva')) < 3
Or still:
SELECT * FROM pessoa WHERE (levenshtein(nome, 'Wallace Silva') < 5) OR soundex(nome) = soundex('Wallace Silva')
SOME CONSIDERATIONS
- The more functions we use, the more processing we need to have on the machine, that is, if we create a query 'monster', it may take time to run
- Both functions have some negative points, I recommend studying them well and testing LONG before implementing in some code
- So far, these have been the most effective ways to search for similar names directly in the SQL query that I found and that have served me. Doesn’t mean there aren’t better ones ;)
Anyway, I hope I helped! Enjoy!
In case, I need the word order consulted in the bank to be past identity, just disconsing the words
de
,da
,dos
anddas
in the database. So, "Luiz Silva" in the query should return "Luiz da Silva" the database. Understood?– Wallace Maxters
And if you had in the bank "Luiz da Silva" and also had "Luiz da Silva Santos", "Luiz da Silva dos Santos", the 3 should be returned ?
– Felippe Tadeu