How to do similar word searches in SQL?

Asked

Viewed 6,772 times

8

Let’s assume I have the following data in a table:

John

Peter

Ronaldo

Luiz

If I use a query %like% it finds if the user type strictly. For example: if he type ron he thinks Ronaldo. But if he types a different letter, ex: ronldo (typo) he doesn’t think. But Google (and others) often find these words similar.

Is there any way to do this in SQL, database independent?

  • You want to implement this in some specific bank?

  • @gmsantos I usually use Postgres, but if there is a universal solution, better.

2 answers

9


In some databases it is possible to do this type of search that is called Fuzzy Matching.

You can get this result through the functions soundex() and difference(). I know they exist in SQL Server and from a quick search I discovered that she is present in the PostgreSQL through the module fuzzystrmatch.

The function soundex() returns a 4-character code from a passed string and the function difference() compares the difference between these codes, at a level of 0 to 4, where 0 is the exact match of the code:

CREATE TABLE s (nm text);

INSERT INTO s VALUES ('john');
INSERT INTO s VALUES ('joan');
INSERT INTO s VALUES ('wobbly');
INSERT INTO s VALUES ('jack');
INSERT INTO s VALUES ('ronaldo');

SELECT soundex('john') AS john, soundex('joan') AS joan, soundex('jack') AS jack; 
-- Retorna 'J500', 'J500' e 'J200'

SELECT * FROM s WHERE soundex(nm) = soundex('john');
-- Casa com 'jonh' e 'joan'    

SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
-- Casa com 'jonh', 'joan' e 'jack'

SELECT * FROM s WHERE soundex(nm) = soundex('ronldo');
-- Casa com 'ronaldo'

Example in Sqlfiddle

In some simple cases, soundex() can solve your problem. According to the Postgresql documentation, this function has better results with English names.

For more complex implementations, there are other similarity comparison methods, such as metaphone() or levenshtein() (See examples in the documentation of fuzzystrmatch).

In the most complex cases I recommend applying an index of the type Full-Text Index for better performance in large databases.

See the documentation of the functions soundex() and difference() in SQL Server.

  • 2

    Interesting subject, and vlw by the answer!

3

Adding some nunces to the @gmsantos response...

Metaphone for Portuguese names

In this question has been widely discussed the phonetic algorithm for Portuguese, which is more efficient than the mathematical similarity of difference or distances such as Hamming, Levenshtein and others, which measure the similarity between any strings (even in genetics they use).

The question goes in the direction of a more practical and already classic problem: grouping or matching given names (street names, people names, etc.). For example, "João"="Joao", "Sylvia"="Silvia", "Luíz"="luis", etc.

The experience of those who have worked (documented in this article) shows that the most frequent spelling errors are due to the spelling mistakes we make when we try to just transcribe what we hear. That’s why the focus on phonetics.

And the phonetics of Portuguese speakers is not the phonetics of English speakers... Thus the best solution is the best phonetic algorithm adapted to Portuguese... And this exists!

This is about the Metaphoneptbr.
(if you do not have access to install external functions on your server, Metaphone generic is also higher than Soundex).

In Postgresql (8.X or 9.X), after installed just do

SELECT metaphone_ptbr('Sylvia')=metaphone_ptbr('sillvya');
-- retorna TRUE  ('SV'=='SV')
SELECT metaphone_ptbr('Sylveira')=metaphone_ptbr('sillvya');
-- retona FALSE ('SVR'!='SV')

The great advantage of this method is that the comparison can be "cached", ie, part of the process can be stored before in the database (the metaphone of all names), so that the search for a given name, or the grouping of similar, is much faster than peer-to-peer evaluation by string similarity functions.

As the grouping allows, a database with 1000 names for example, one can reduce the analysis to a group of 10 or 20 names, and on them apply the most sophisticated functions (cost more CPU) of string similarity.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.