Adding some nunces to the @gmsantos response...
Metaphone for Portuguese names
In this question has been widely discussed the phonetic algorithm for Portuguese, which is more efficient than the mathematical similarity of difference
or distances such as Hamming, Levenshtein and others, which measure the similarity between any strings (even in genetics they use).
The question goes in the direction of a more practical and already classic problem: grouping or matching given names (street names, people names, etc.). For example, "João"="Joao", "Sylvia"="Silvia", "Luíz"="luis", etc.
The experience of those who have worked (documented in this article) shows that the most frequent spelling errors are due to the spelling mistakes we make when we try to just transcribe what we hear. That’s why the focus on phonetics.
And the phonetics of Portuguese speakers is not the phonetics of English speakers... Thus the best solution is the best phonetic algorithm adapted to Portuguese... And this exists!
This is about the Metaphoneptbr.
(if you do not have access to install external functions on your server, Metaphone
generic is also higher than Soundex
).
In Postgresql (8.X or 9.X), after installed just do
SELECT metaphone_ptbr('Sylvia')=metaphone_ptbr('sillvya');
-- retorna TRUE ('SV'=='SV')
SELECT metaphone_ptbr('Sylveira')=metaphone_ptbr('sillvya');
-- retona FALSE ('SVR'!='SV')
The great advantage of this method is that the comparison can be "cached", ie, part of the process can be stored before in the database (the metaphone of all names), so that the search for a given name, or the grouping of similar, is much faster than peer-to-peer evaluation by string similarity functions.
As the grouping allows, a database with 1000 names for example, one can reduce the analysis to a group of 10 or 20 names, and on them apply the most sophisticated functions (cost more CPU) of string similarity.
You want to implement this in some specific bank?
– gmsantos
@gmsantos I usually use Postgres, but if there is a universal solution, better.
– user7261