How to make a phonetic algorithm for Brazilian Portuguese?

Asked

Viewed 17,915 times

122

To facilitate the search for commonly misspelled information, we use phonetic algorithms. A feature that is extremely useful but that is often neglected, especially outside the English language where there are some algorithms known as the ones mentioned in Wikipedia, particularly the Soundex available in various Dbs and languages.

We have particularities in our language that make it impossible to use algorithms from other languages. In fact it is even more regionalized. A algorithm that works for Brazil, does not work for Portugal and perhaps for other Portuguese-speaking countries. I have doubts whether I would need specializations for the cearense, gaucho or even for the piracicabano and adjacencies (hundreds of kilometers :) ), just to name a few.

  • Is there any official source, a conclusive study of what our phonetic algorithm should look like?
  • Based on these studies or by experience, what would this algorithm look like? We can take advantage of some variant of Soundex, Metaphone or another established English-language algorithm, only changing phonemes?
  • We may or should use (optionally) common foreign phonemes as well, since we use foreign names?
  • I am interested in the algorithm itself, therefore, how to develop it in detail (it is not enough general lines that already has enough information), a pseudo-code or real code in some language would be useful but not funadamental.

Since I can carry code from almost any language mainstream, and in fact I will (and others may) use in some different languages, I don’t care if it’s in C, C++, C#, Java, Javascript, PHP, Perl, Python, Ruby, Lua, Delphi, Go, D, Scala, F#, or variants of BASIC, xbase (Clipper, Harbour, Foxpro), SQL, etc. or even COBOL :). Use what you already have or feel more comfortable.

Note that the best algorithm is what it’s worth, not the implementation, so I don’t care about the language. The same algorithm in different language will be considered duplicated.

Some references I know (some are pretty bad):

Even though I’ve accepted an answer, I think something better can still come up and I’m willing to change the acceptance if this happens. I still want to see your answer.

  • 8

    The piracicabano algorithm must obligatorily turn on the speaker and release a: "Pamonhas! Pamonhas! Pamonhas!" :) (derived audio without the best part and to Desmitification)

  • 2

    I found an interesting study of text conversion system in the respective phonetic transcription. It can be useful to think of a different approach from the traditional ones (soundex/metaphone). It is a doctoral thesis with UFSC, which is an important center in the studies of computational linguistics in Brazil. https://repositorio.ufsc.br/bitstream/handle/123456789/91849/254656.pdf?sequence=1

  • @bfavaretto too I think. I was saving it for later, but also didn’t want to miss the opportunity. I like UFSC. To be a very interesting thesis.

  • 2

    I don’t know much about it (and I don’t dare to answer), but apparently there are some studies/projects related to Metaphone: http://link.springer.com/chapter/10.1007%2F978-3-642-28601-8_25 e http://sourceforge.net/projects/metaphoneptbr/

  • @Luizvieira very interesting. It is helping a lot.

  • @Bigown: I’m happy to help. :)

  • 2

    When I worked at Prodesp (I left there in 2006), an algorithm like this was developed to register people in the system of the Court of Justice of the State of São Paulo. Unfortunately I couldn’t find anything related to this openly on the internet, but this article gives a general idea of an algorithm that I consider similar to that developed at Prodesp.

  • ... note that the algorithm I mentioned was actually optimized for the name of people.

  • The number of distinct accents in Brazil is impressive. Surely the algorithm needs to be differentiated for certain regions. Where I live one speaks practically a dialect apart. Good luck.

Show 4 more comments

6 answers

72


The idea behind "Metaphone Pt-BR" is precisely the history of success and the use of algorithms like soundex during the first American censuses and the improvement of the idea, in this case with the emergence of metaphone.

The niche of this algorithm is very specific. It is not an accurate phonetic representation, following the IPA, but word simplifications based on "sounds like ...". The use, like soundex, was mainly in cross-referencing textual data and incorrectly graphed duplicate name identifications. Its advantage is to be able to reduce the computational effort needed to find a word similar to another using algorithms like similar_text() (in PHP), Levenshtein, among others.

How could Metaphone be useful in this case? Algorithms such as similar_text, Levenshtein and others return an index, usually between 0 and 1, of the degree of proximity between two strings, 0 being equal to no similarity and 1 being the total similarity.

Imagine then to cross-reference a database of thousands of street names, each with an average of three to four words, with a street name of the same average size, from another database, which may have been misspelled, or abbreviated, finally, not exactly the same as what you own, you would need to verify the similarity between each word and take an average for the street. Considering that these similarity algorithms have O(m*n) and O(n 3) complexities, then we are facing major computational efforts to find a single street.

The metaphone goes into that part. By simplifying the string, reducing it to a size of up to 4 characters, it becomes possible to create an "index" for similar words. For example, REBECA, REBBECA, RABEKA and so many other variations would share the same metaphone string: RBK. With this, I can apply the Levenshtein algorithm in a reduced set (usually between 1 and 2 tens) of words, reducing the computational effort required.

Of course one can have other approaches to solve the problem cited, but Metaphone is one of them would only require having in the database one more column to serve as index.

The Portuguese version was made because the original Metaphone uses the English grammar rules, so many Portuguese words end up falling into different groups because they would sound different to an American.

The article that @Luiz-Vieira cites, in the comments of the author of the question, is my own, motivated by the work participated in the REDECA, where the challenge was to set up a register of people who could have several documents, so none of them mandatory. Those who develop systems know that normally the CPF, being unique and numerical, is commonly using as a primary key and avoid duplicate registration. The phonetic output is one of the approaches that appeared in the group. With this, I deepened the algorithm using a database assembled with names of people, 1 million names and 220,000 unique words, resulting in the study mentioned above, where I understand that 4 characters are sufficient and what would be the rules for Metaphone Portuguese - Brazilian. The implementation of these rules are in the C code available in Sourceforge and Javascript @João-Paraná.

I believe, until the date of the REDECA project, this would be one of the first open variants for the Brazilian Portuguese of Metaphone, yielding a scientific article.

And to add answer to the question, follow the README link of the code in C that contains the conversion rules of Metaphone Pt-BR, published in article: http://sourceforge.net/p/metaphoneptbr/code/ci/master/tree/README

28

I started making one that works with the following algorithm:

  1. Have at the system’s disposal a database with as many words as possible in Portuguese, which can be obtained in dictionaries.

  2. "Normalize" the user string. In my case, I removed repeated words, lowercase everything, removed articles, punctuation characters, prepositions, conjunctions, adverbs, etc. The goal is to make the string as small as possible.

  3. Check which words are in the dictionaries of item 1. These will be kept in the string.

  4. Check the words in the string that are not in the dictionary, and try to compare them with those in the dictionary. This function should return a percentage, and you set a limit so that two words are considered "equal". Ex: carroça and carrosa. A simple count, and order check of the letters shows that they are 85.714% similar, so they are "the same word". Of course there are better algorithms, and already implemented in some languages (Ex: the function similar_text(), in PHP.)
    These "wrong" words will be replaced by their dictionary counterparts, will be replaced in the original string.

  5. Any word in the string that was not identified in steps 3 or 4 is removed.

  6. Then we have a clean string, and with most bugs fixed. It will be saved in a table that has only one index Fulltext that stores "clean strings", with the content of a blog post for example, and a foreign key to, in this case, reference the blog post of another table.

  7. The "almost phonetic search" process takes place by performing the previous steps in a string sent by the user, and searching in the column Fulltext.

I use a similar algorithm in a search system, and it’s been working so far. Of course, searches like soundex would render a doctoral thesis, is a complex subject, and still little addressed for the regionalities and peculiarities of the Portuguese language. The above algorithm never went through a performance optimization, like using cache, or something like that, but I intend to do it.

28

Names vs words of our vocabulary

Given names are present not only in registers of natural persons, but also in place-names, street names, etc. It is a universe apart, immense, and the most common challenge in the tasks of normalization and "marriage of records" in databases.

Already the words of general use of a text, ie vocabulary of our language, can be isolated from first names.

The statistical result of the analysis of vocabulary words differs from the analysis of first names... Proper names tend to be "multilingual" (different pronunciations and backgrounds) while vocabulary, even with its foreigns, has more of the face of your language.

Strictly speaking, we would then have three types of phonetic algorithms: optimized for proper names, optimized for vocabulary, and generic. In practice most try to be generic, but like well remembered the Jordan, the most common application is to group first names similar.

Metaphone-standards

I also had contact with Lawrence Philips, who created Metaphone in 1990, Double Metaphone in 2000 and Metaphone3 in 2009. We exchanged long emails when I started to draft the project,

** https://code.google.com/p/metaphone-standards/ **

My concern was the interoperability of Metaphone-like indexes in databases and XML; since, if each of us creates its own little algorithmic variation, for its language or its optimizations (names vs words), then the indexes are incompatible. Standardization is always accurate, it was an attempt, and it’s still on, if anyone wants to help resume the project.

In the exchange of emails I had with Philip, between 2011 and 2012, we discussed about the "evolutionary history" of phonetic hash, and about the existence of standards... And in the middle of it Philip expressed that he considered "himself the standard", and that it does not think it fair that I want to propose new "standards", drawing attention away from his company, amorphics.com...

Since Metaphone is just one of them - @Jcködel well remembered here others like "SOUNDEX-en" and "SOUNDEX-en" -, and I was getting help from Philip, I agreed to change the name of the project, to something like "Metaphone Recipes" or "Phonetic Hashing Recipes" (I prefer the latter)... I even started, but I didn’t have more time.

examples

Formal specification (no optimized implementation) of some variations of Metaphone, designated by the language (en=english, pt=portuguese, en=brazilian Portuguese, es=Spanish) and the (1, 1.1, 1.2, ...):

  • Metaphone-en version 1: 1990 by Philips.

  • Metaphone-en version 1.1: a small improvement, based on the suggestions of Kuhn (1995). The reengineering (to obtain formal specification) was based on the Battley code in Ruby.

  • Metaphone-en version 1.2.1: is the Philips Double Metaphone, which I started to "translate" to psceudo-code, but I didn’t have time to finish (someone helps?).

  • Metaphone-en version 1: of ~2008, algorithm of the staff of the Municipality of Vársea Paulista. According to the link, they were based on a "Metaphone of Spanish".

  • Metaphone en-BR version 1.2: was developed by Jordão in language C, which is still the fastest, safest and most used in Brazil.

  • Metaphone-es version 1: ... I have in my chest an algorithm (cited by the staff of Vársea Paulista) of I. J. Sustaita of 2005... If someone says it helps I dig it up.

  • Metaphone-es version 2: ... well younger, 2011, of A. Mosquera, and he claims to have taken as a direct basis the Metaphone-en version 2, not the Metaphone-es version 1... we need to endorse and properly register in the project, making the reengineering (for psceudo-code) of his python code.

Answer

to the main question (how to make a phonetic algorithm for Brazilian Portuguese? ): "make an algorithm" can be 1- "high-level algorithm" (formal specification), or 2- "implementation algorithm".

  • Answer-1: create your own MLAV - Metaphone-Language Algorithm Version following recommended rules, so that others can enjoy, discuss or use. As @Maniero commented, we can assume that "an algorithm that works for Brazil, does not work for Portugal", that is, if someone makes an effort can demonstrate that there is a MLAV "phonetically optimized" for the pt-PT, and create the default (MLAV) Metaphone-en-EN version 1 in the project metaphone-standards.

  • Answer-2: from a MLAV any, for example the Metaphoneptbr v1.2, implement it. @Joãoparaná for example made its implementation based on Metaphoneptbr v1 in Javascript, but did not use regular expressions, whereas you could do another implementation using regular expressions.

to the first sub-paragraph (Is there any official source, a conclusive study of what our phonetic algorithm should look like? ): there is the project I quoted, Metaphone-standards, that proposed to "formalize" the various variants; on "conclusive study" exists the article from the Jordan, justifying the choice for what I named Metaphone en-BR version 1.2.

to the second sub-paragraph (Based on these studies or by experience, what would this algorithm look like? ): I think everyone here has already given their opinion, only evaluate, including by votes.

to the second sub-paragraph (Based on these studies or by experience, what would this algorithm look like? ): I think everyone here has already given their opinion, only evaluate, including by votes.

to the third subparagraph (Can we or should we use common foreign phonemes as well, since we use foreign names? ): see my post above on "Names vs words of our vocabulary". Personally I think that if you are going to index "generic text" you need a generic algorithm that includes foreign names and forenames.

to the fourth subplift (I am interested in the algorithm itself... a pseudo-code...): it is the proposal of the project Metaphone-standards cited, there you find the pseudo-code of the desired language and version, or express a new proposal, not yet existing.

22

The Metaphone already mentioned above by @Uiz-vieira is able to generate strings by phonetic similarity from strings. C fonts can be seen at this link.

See below text extracted from the project’s READ-ME.

Metaphone for Brazilian Portuguese

The metaphone is a rule-based text transformation algorithm phonetics en.wikipedia.com/wiki/Metaphone. The rules were based on a joint work published de Várzea Paulista www2.varzeapaulista.sp.gov.br/metaphone, during the REDECA project, focused on childhood and adolescence. This port is a variation for Portuguese, at least Brazilian.

I recently had a contact with the Carlos Jordão author of the Implementation and suggested the use of regular expressions in Javascript to get a solution that worked in the Browser and Server Side with Node JS. He found a later implementation using switch/case to control a state machine more appropriate, since this is the C implementation available in the GIT repository of the project already mentioned.

He did the initial port and set it to an anonymous function and a page for unit testing and can now be converted into plugin for Jquery or Module for YUI 3 or Node.

By the way the C version of Jordan can be used in SGBD Postgresql, PHP5 and Debian / Ubuntu and can be ported to other environments.

See the GIST with Javascript implementation getMeta() function of YUI 3 Y.Metaphone module

The test page is here at this link

12

Considering only the implementation of the algorithm I have a function adapted to Mysql in which the phonemes have been analyzed and improved as time goes by to reflect the searches performed in some systems. It may contain some inconsistencies (as in the case of "W" which has no defined rule), but solves the vast majority of cases. The details of the implementation are commented on in the course of the code:

DROP FUNCTION IF EXISTS transformar_fonetica;

DELIMITER $
CREATE FUNCTION transformar_fonetica(ptexto TEXT)
RETURNS TEXT
BEGIN
  DECLARE vtexto             TEXT;
  DECLARE vtexto_apoio       TEXT;
  DECLARE vposicao_atual     INT;
  DECLARE vcaracter_anterior VARCHAR(1);
  DECLARE vcaracter_atual    VARCHAR(1);
  DECLARE vcaracter_seguinte VARCHAR(1);
  DECLARE vsom               VARCHAR(2);
  DECLARE com_acentos        VARCHAR(65);
  DECLARE sem_acentos        VARCHAR(65);

  SET vtexto = UPPER(ptexto);

  SET com_acentos = 'ŠšŽžÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸÞàáâãäåæèéêëìíîïñòóôõöøùúûüýÿþƒ';
  SET sem_acentos = 'SsZzAAAAAAAEEEEIIIINOOOOOOUUUUYYBaaaaaaaeeeeiiiinoooooouuuuyybf';
  SET vposicao_atual = LENGTH(com_acentos);

  -- Remove acentos
  WHILE vposicao_atual > 0 DO
    SET vtexto = REPLACE(vtexto, SUBSTRING(com_acentos, vposicao_atual, 1), SUBSTRING(sem_acentos, vposicao_atual, 1));
    SET vposicao_atual = vposicao_atual - 1;
  end while;

  -- Remove caracteres inválido
  SET vposicao_atual = 1;

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF INSTR('ABCÇDEFGHIJKLMNOPQRSTUVWXYZ ', vcaracter_atual) <> 0 THEN
      SET vtexto_apoio = CONCAT(IFNULL(vtexto_apoio, ''), vcaracter_atual);
    END IF;

    SET vposicao_atual = vposicao_atual + 1;
  END WHILE;

  SET vtexto = vtexto_apoio;

  -- Substitui os mais simples
  SET vtexto = REPLACE(vtexto, 'Ç', 'S');
  SET vtexto = REPLACE(vtexto, 'SH', 'X');
  SET vtexto = REPLACE(vtexto, 'XC', 'S');
  SET vtexto = REPLACE(vtexto, 'QU', 'K');
  SET vtexto = REPLACE(vtexto, 'CH', 'X');
  SET vtexto = REPLACE(vtexto, 'PH', 'F');
  SET vtexto = REPLACE(vtexto, 'LH', 'LI');
  SET vtexto = REPLACE(vtexto, 'NH', 'NI');

  -- Remove duplicados. Menos o S que altera o som da sílaba
  SET vposicao_atual = 1;
  SET vtexto_apoio = '';

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF vposicao_atual < LENGTH(vtexto) THEN
      SET vcaracter_seguinte = SUBSTRING(vtexto, vposicao_atual + 1, 1);
    ELSE -- Último caracter não tem motivo para ser verificado
      SET vcaracter_seguinte = '';
    END IF;

    IF vcaracter_atual <> vcaracter_seguinte OR vcaracter_atual <> 'S' THEN
      SET vtexto_apoio = CONCAT(vtexto_apoio, vcaracter_atual);
    END IF;

    SET vposicao_atual = vposicao_atual + 1;
  END WHILE;

  SET vtexto = vtexto_apoio;

  -- Troca caracteres pelo som
  SET vposicao_atual = 1;
  SET vtexto_apoio = '';

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF vposicao_atual < LENGTH(vtexto) THEN
      SET vcaracter_seguinte = SUBSTRING(vtexto, vposicao_atual + 1, 1);
    ELSE
      SET vcaracter_seguinte = '';
    END IF;

    -- "B" seguindo de qualquer caracter que não seja "A", "E", "I", "O", "U", "R" ou "Y"
    IF vcaracter_atual = 'B' AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
        SET vsom = 'BI';
    -- "C" seguindo de "E", "I" ou "Y"
    ELSEIF vcaracter_atual = 'C' AND INSTR('EIY', vcaracter_seguinte) <> 0 THEN
      SET vsom = 'S';
    ELSEIF vcaracter_atual = 'C' THEN
      SET vsom = 'K';
    ELSEIF vcaracter_atual = 'D'  AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
      SET vsom = 'DI';
    ELSEIF vcaracter_atual = 'G' AND INSTR('EIY', vcaracter_seguinte) <> 0  THEN -- GE, GI OU GY
      SET vsom = 'J';
    ELSEIF vcaracter_atual = 'G' AND vcaracter_seguinte = 'T' THEN -- GT
      SET vsom = '';
    ELSEIF vcaracter_atual = 'H' THEN -- O H é a única letra do nosso alfabeto sem valor fonético, ou seja, sem som.
      SET vsom = '';
    ELSEIF vcaracter_atual = 'N' AND INSTR('AEIOUY', vcaracter_seguinte) = 0 THEN -- Quando for seguida de uma consoante, recebe o som fechado "M"
      SET vsom = 'M';
    ELSEIF vcaracter_atual = 'P' AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
      SET vsom = 'PI';
    ELSEIF vcaracter_atual = 'Q' THEN
      SET vsom = 'K';
    -- QUA, QUE, QUI, QUO ou QUY
    ELSEIF IFNULL(vcaracter_anterior, '') = 'Q' AND vcaracter_atual = 'U' AND INSTR('AEIOY', vcaracter_seguinte) <> 0 THEN
      SET vsom = '';
    -- Quando se localiza entre duas vogais, tem sempre o valor da sonora "Z". Exemplo: Coisa, faisão, mausoléu, lousa, Neusa, Brasil, Sousa, cheiroso, manhoso, gasoso, etc.
    ELSEIF (IFNULL(vcaracter_anterior, '') <> '' AND INSTR('AEIOUY', IFNULL(vcaracter_anterior, '')) <> 0) AND vcaracter_atual = 'S' AND INSTR('AEIOUY', vcaracter_seguinte) <> 0 THEN
      SET vsom = 'Z';
    ELSEIF vcaracter_atual = 'S' AND vcaracter_seguinte = 'C' THEN -- "S" seguido de "C" não tem som
      SET vsom = '';
    ELSEIF vcaracter_atual = 'W' THEN -- O "W" não tem uma regra definida, podendo ser "V" ou "U" dependendo da palavra
      SET vsom = 'V';
    ELSEIF vcaracter_atual = 'X' AND INSTR('AEIOUY', vcaracter_seguinte) <> 0 THEN -- "X "seguido de vogal Exemplo: Exemplo
      SET vsom = 'Z';
    ELSEIF vcaracter_atual = 'X' AND INSTR('AEIOUY', vcaracter_seguinte) = 0 THEN -- "X" seguido de consoante. Exemplo: Exceção
      SET vsom = 'S';
    ELSEIF vcaracter_atual = 'Y' THEN
      SET vsom = 'I';
    ELSEIF vcaracter_atual = 'Z' AND INSTR('AEIOUY', vcaracter_seguinte) = 0 THEN
      SET vsom = 'S';
    ELSE
      SET vsom = vcaracter_atual;
    END IF;

    SET vcaracter_anterior = vcaracter_atual;
    SET vposicao_atual = vposicao_atual + 1;
    SET vtexto_apoio = CONCAT(vtexto_apoio, vsom);
  END WHILE;

  SET vtexto_apoio = REPLACE(vtexto_apoio, 'SS', 'S'); -- Remove o "SS" que foi utilizado para decidir se continuava como "S" ou virava "Z"
  SET vtexto = vtexto_apoio;

  RETURN vtexto;
END
$

Some sample outputs:

NEUSA   | NEUZA
HERESIA | EREZIA
PALHA   | PALIA
QUERO   | KERO
EXIGIR  | EZIJIR
HESITAR | EZITAR

Some with misspellings:

ÇAPO      | SAPO
EREZIA    | EREZIA
ESITAR    | EZITAR
LAGOZTA   | LAGOSTA
HORIENTAR | ORIEMTAR

EDIT

Revising the rules I decided to run a similar algorithm but in Java. I noticed that the "X" also does not have a defined rule, so it is only possible to treat some cases. I also added the treatment for the ~. The result was as follows:

import java.text.Normalizer;
import java.util.LinkedHashSet;

public class Fonetica {

  public String converterFrase(String frase) {
    LinkedHashSet<String> palavras;

    palavras = this.converter(frase.split(" "));

    return String.join(" ", palavras);
  }

  public LinkedHashSet<String> converter(String... palavras) {
    LinkedHashSet<String> resultado = new LinkedHashSet<>();

    for (String palavra : palavras) {
      resultado.add(this.converter(palavra));
    }

    return resultado;
  }

  public String converter(String palavra) {
    palavra = palavra.toUpperCase();

    palavra = palavra.replace("Ç", "SS");
    palavra = palavra.replace("Y", "I");
    palavra = palavra.replace("W", "V"); // "W" não tem uma regra definida, as vezes é "V", as vezes é "U"
    palavra = palavra.replace("GT", "");
    palavra = palavra.replace("Q", "K");
    palavra = palavra.replace("SH", "X");
    palavra = palavra.replace("CH", "X");
    palavra = palavra.replace("PH", "F");
    palavra = palavra.replace("LH", "LI");
    palavra = palavra.replace("NH", "NI");
    palavra = palavra.replace("H", ""); // O "H" é a única letra do nosso alfabeto sem valor fonético.

    palavra = this.removerDuplicadas(palavra);

    // Acentuações
    palavra = palavra.replaceAll("([ÃÕ])([EO])", "$1-$2"); // Separa as sílabas
    palavra = palavra.replaceAll("([ÃÕ])", "$1M");
    palavra = this.removerAcentos(palavra);

    palavra = palavra.replaceAll("([BDP])([^AEIOU]|$)", "$1I$2"); // "B", "D" e "P" mudos
    palavra = palavra.replaceAll("C([AOUR])", "K$1"); // "CA", "CO" e "CU" viram "KA", "KO" e "KU" respectivamente
    palavra = palavra.replaceAll("C([EI])", "SS$1"); // "CE" e "CI" viram "SSE" e "SSI" respecivamente
    palavra = palavra.replaceAll("C([^AEIOU]|$)", "KI$1"); // "C" mudo tem som de "KI"
    palavra = palavra.replaceAll("G([EI])", "J$1"); // "GE" e "GI" tem som de "JE" e "JI" respectivamente
    palavra = palavra.replaceAll("L([^AEIOU]|$)", "U$1"); // Quando o "L" vem antes de consoante
    palavra = palavra.replaceAll("N([^AEIOU]|$)", "M$1"); // Quando "N" for seguida de uma consoante, recebe o som fechado "M"
    palavra = palavra.replaceAll("X([^AEIOU]|$)", "SS$1"); // Quando o "X" é seguido por uma vogal, tem som de "SS"
    palavra = palavra.replaceAll("([AEIOU])S([AEIOU])", "$1Z$2");
    palavra = palavra.replaceAll("Z([^AEIOU]|$)", "SS$1"); // "Z" seguido de vogal tem som de "SS"

    palavra = palavra.replaceAll("S+", "S"); // Mais de 1 "S" vira apenas 1
    palavra = palavra.replace("OU", "O"); // Quando o "U" segue o "O" não tem som

    return palavra;
  }

  private String removerDuplicadas(String texto) {
    String[] letras = "ABCDEFGHIJKLMNOPQRTUVWYXZ".split("");

    for (String letra : letras) {
      texto = texto.replaceAll(letra + "+", letra);
    }

    return texto;
  }

  private String removerAcentos(String texto) {
    texto = Normalizer.normalize(texto, Normalizer.Form.NFD);
    texto = texto.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");

    return texto;
  }
}

Functioning in the IDEONE

7

One way to perform phonetic search (which is the same as used by SQL Server in the SOUNDEX function) is to assign numbers to the phonetic sets of the word.

Ex.: No SQL (which has phonetic search in English):

SELECT 'BROWN', SOUNDEX('BROWN')
UNION
SELECT 'BRAWN', SOUNDEX('BRAWN')

Both result in B650.

There are several articles and even a little code that tries to implement this in Portuguese: http://www.scribd.com/doc/38615737/BuscaBR-Fonetica http://www.linhadecodigo.com.br/artigo/2237/implementando-algoritmo-buscabr.aspx http://www.brunoportfolio.com/arquivos/pdf/BuscaBR_Fonetica.pdf http://www.macoratti.net/sql_sdex.htm

Code: http://www.devmedia.com.br/forum/soundex-em-portugues/274192

  • 1

    These links are the same as the question?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.