How can I make a query with LIKE or REGEXP ignoring table words?

Asked

Viewed 785 times

4

I have the following data in a table

-------------------
**usuarios**
-------------------
nome
------------------
Wallace de Souza Vizerra
------------------------
Gustavo Carmo da Costa

I need to return the table records usuarios containing a certain value in the field nome, but need to ignore some existing words in the field values nome

Example:

SELECT * FROM usuarios WHERE nome LIKE 'Wallace Souza Vizerra'
#Sem o "de" no nome

I would like the consultation on nome ignore the words de, da, dos, das existing in the bank.

How can I make this query, with LIKE or REGEXP, excluding words?

3 answers

2

I had a similar problem here in my company. Basically, here we use two ways to identify homonyms with typos.

The first is the distance Levenshtein and the second is the soundex function().

LEVENSHTEIN

Paraphrasing Wikipedia "the Levenshtein distance or editing distance between two "strings" is given by the minimum number of operations needed to transform one string into the other."

For example, the distance Levenshtein between "Guilherme Silva" and "Guilherme da Silva" is 3. Between "Maria Dores" and "Maria das Dores" is 4.

Follow here the code to create the Levenshtein() function in Mysql.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;

To use, in your case, you can make the following query:

SELECT * FROM usuario as us WHERE levenshtein(us.nome, 'Wallace Silva') < 5 //Ou outro indice. Coloquei 5 para que ele pegue somente nomes que tenham no maximo indice 4 na distancia levenshtein.

If you have in the bank the following records

--------------------
1 Wallace Silva
2 Wallace João da Silva
3 Wallace das Silva
4 Guilherme da Silva
--------------------

The query will return, in this case, only records 1 and 2.

SOUNDEX

The soundex() function is a function that can be used in database searches where you know the pronunciation but not exactly how it is written.

soundex() is already native to Mysql and we don’t need to create it manually.

In your case, the consultation would look something like this:

SELECT * FROM pessoa WHERE soundex(nome) = soundex('Wallace Silva')

The problem with using soundex() is that it works better with English words and does not work 100% in Portuguese.

COMBINED MODE

We can also create a query in which we combine the two forms of search so that we can have a higher probability of success.

We can make it so:

SELECT * FROM pessoa WHERE levenshtein(soundex(nome), soundex('Wallace Silva')) < 3

Or still:

SELECT * FROM pessoa WHERE (levenshtein(nome, 'Wallace Silva') < 5) OR soundex(nome) = soundex('Wallace Silva')

SOME CONSIDERATIONS

  • The more functions we use, the more processing we need to have on the machine, that is, if we create a query 'monster', it may take time to run
  • Both functions have some negative points, I recommend studying them well and testing LONG before implementing in some code
  • So far, these have been the most effective ways to search for similar names directly in the SQL query that I found and that have served me. Doesn’t mean there aren’t better ones ;)

Anyway, I hope I helped! Enjoy!

1

It may not be the most appropriate way, but even for performance reasons (searching for 'wild' values can be costly), my first idea for this problem would be to store a 'fixed' auxiliary field, without the terms you want to eliminate. The search would be done in this field (after fixing, also, the search string).

Then we would have

Nome                NomeTratado
____                _____________
Ricardo de Melo     Ricardo Melo

Again, that would be my first idea. And I think there’s a good chance that there’s a better way.

Another approach would be to create an index with each part of the name. This can be interesting to make the search return values when the person type only the last name.

0

I didn’t understand it very well, but you could do a like after % after typing the last character.

Select nome From usuarios Where nome Like 'Wallace%'

or for each space you replace it with the "%" character or before doing Sql you exchange these prepositions and replace them with %.

There are several ways to do this.

  • In case, I need the word order consulted in the bank to be past identity, just disconsing the words de, da, dos and das in the database. So, "Luiz Silva" in the query should return "Luiz da Silva" the database. Understood?

  • And if you had in the bank "Luiz da Silva" and also had "Luiz da Silva Santos", "Luiz da Silva dos Santos", the 3 should be returned ?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.