Smart Query with Mysql

Asked

Viewed 1,340 times

9

I’m trying to make a "smarter" query on my DB, my doubt is as?

I have a query:

SELECT * from publicacao where titulo like '%$busca%';

and in my DB has several publishing titles, for example:

Farmácia
Pharmácia
Pharmacia
Farmacia

When searching by pharmacy, the 4 lines should be returned.

It is possible to do this?

  • If you give a like on cia it will bring, but will bring more things too... Have little information to give a better answer (at least for me :) )

  • then the word sought could be one of the 4 and always return the 4 lines or all q had pharmacy in one of these variations

  • Search for FTS (full text search). The accent has no problem, Mysql does natively if you use an international collation (either in UTF-8, or Latin/ISO, works great, just need to choose one compatible with encoding). In the case of PH already changes a little.

  • I get it, I’ve worked in a place where we called this phonetic search, they registered one of the syllables and there was an engineering to locate based on that, but it’s not very simple

  • @Ricardo in Portuguese is very boring, because there are combinations that change all the meaning depending on the division of the words. I find it more efficient to do a similar letter approximation instead of an exact search, but it has to be analyzed on a case-by-case basis. In my systems already adopted search for substring, so if one needs Luiz/Luis Augusto, she researches Lui Augusto and ready :) - For more complicated thing, I use a variant of Levenshtein

  • @Bacco, I agree 100% with you... It is not the best solution (I’m even reading about the FTS here), but it is a solution that worked and used there, as I’ve seen working I decided to talk.

  • @Ricardo I think it’s nice that you spoke. I only mentioned because I’ve seen enough trying this way and getting complicated. I don’t even think it’s impossible no, I just didn’t see anything ready optimized for Portuguese, because people end up trying to reuse things made for English, and then it doesn’t work well. If someone stops to write an algorithm thinking about our particularities, even able to be cool. You just have to be careful with things that give more of a phonetic combination.

  • 3

    If you want something ready, you have this solution with Levenshtein here http://stackoverflow.com/a/4671557/916193

  • thank you very much, I will try to use I think q should solve my problem. I now have a question. How much this weighs the script?

Show 4 more comments

2 answers

8


If you want to search for the phonetics of the words adapted to Portuguese perform the function:

DROP FUNCTION IF EXISTS transformar_fonetica;

DELIMITER $
CREATE FUNCTION transformar_fonetica(ptexto TEXT)
RETURNS TEXT
BEGIN
  DECLARE vtexto             TEXT;
  DECLARE vtexto_apoio       TEXT;
  DECLARE vposicao_atual     INT;
  DECLARE vcaracter_anterior VARCHAR(1);
  DECLARE vcaracter_atual    VARCHAR(1);
  DECLARE vcaracter_seguinte VARCHAR(1);
  DECLARE vsom               VARCHAR(2);
  DECLARE com_acentos        VARCHAR(65);
  DECLARE sem_acentos        VARCHAR(65);

  SET vtexto = UPPER(ptexto);

  SET com_acentos = 'ŠšŽžÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸÞàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿþƒ';
  SET sem_acentos = 'SsZzAAAAAAACEEEEIIIINOOOOOOUUUUYYBaaaaaaaceeeeiiiinoooooouuuuyybf';
  SET vposicao_atual = LENGTH(com_acentos);

  -- Remove acentos
  WHILE vposicao_atual > 0 DO
    SET vtexto = REPLACE(vtexto, SUBSTRING(com_acentos, vposicao_atual, 1), SUBSTRING(sem_acentos, vposicao_atual, 1));
    SET vposicao_atual = vposicao_atual - 1;
  end while;

  -- Remove caracteres inválido
  SET vposicao_atual = 1;

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF INSTR('ABCDEFGHIJKLMNOPQRSTUVWXYZ ', vcaracter_atual) <> 0 THEN
      SET vtexto_apoio = CONCAT(IFNULL(vtexto_apoio, ''), vcaracter_atual);
    END IF;

    SET vposicao_atual = vposicao_atual + 1;
  END WHILE;

  SET vtexto = vtexto_apoio;

  -- Substitui os mais simples
  SET vtexto = REPLACE(vtexto, 'SS', 'S');
  SET vtexto = REPLACE(vtexto, 'SH', 'X');
  SET vtexto = REPLACE(vtexto, 'XC', 'S');
  SET vtexto = REPLACE(vtexto, 'QU', 'K');
  SET vtexto = REPLACE(vtexto, 'CH', 'X');
  SET vtexto = REPLACE(vtexto, 'PH', 'F');
  SET vtexto = REPLACE(vtexto, 'LH', 'LI');

  -- Remove duplicados
  SET vposicao_atual = 1;
  SET vtexto_apoio = '';

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF vposicao_atual < LENGTH(vtexto) THEN
      SET vcaracter_seguinte = SUBSTRING(vtexto, vposicao_atual + 1, 1);
    ELSE -- Último caracter não tem motivo para ser verificado
      SET vcaracter_seguinte = '';
    END IF;

    IF vcaracter_atual <> vcaracter_seguinte THEN
      SET vtexto_apoio = CONCAT(vtexto_apoio, vcaracter_atual);
    END IF;

    SET vposicao_atual = vposicao_atual + 1;
  END WHILE;

  SET vtexto = vtexto_apoio;

  -- Troca caracteres pelo som
  SET vposicao_atual = 1;
  SET vtexto_apoio = '';

  WHILE vposicao_atual <= LENGTH(vtexto) DO
    SET vcaracter_atual = SUBSTRING(vtexto, vposicao_atual, 1);

    IF vposicao_atual < LENGTH(vtexto) THEN
      SET vcaracter_seguinte = SUBSTRING(vtexto, vposicao_atual + 1, 1);
    ELSE
      SET vcaracter_seguinte = '';
    END IF;

    -- "B" seguindo de qualquer caracter que não seja "A", "E", "I", "O", "U", "R" ou "Y"
    IF vcaracter_atual = 'B' AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
        SET vsom = 'BI';
    -- "C" seguindo de "E", "I" ou "Y"
    ELSEIF vcaracter_atual = 'C' AND INSTR('EIY', vcaracter_seguinte) <> 0 THEN
      SET vsom = 'S';
    ELSEIF vcaracter_atual = 'C' THEN
      SET vsom = 'K';
    ELSEIF vcaracter_atual = 'D'  AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
      SET vsom = 'DI';
    ELSEIF vcaracter_atual = 'E' THEN
      SET vsom = 'I';
    ELSEIF vcaracter_atual = 'G' AND INSTR('EIY', vcaracter_seguinte) <> 0  THEN -- GE, GI OU GY
      SET vsom = 'J';
    ELSEIF vcaracter_atual = 'G' AND vcaracter_seguinte = 'T' THEN -- GT
      SET vsom = '';
    ELSEIF vcaracter_atual = 'H' THEN
      SET vsom = 'H';
    ELSEIF vcaracter_atual = 'N' THEN
      SET vsom = 'M';
    ELSEIF vcaracter_atual = 'P' AND INSTR('AEIOURY', vcaracter_seguinte) = 0 THEN
      SET vsom = 'PI';
    ELSEIF vcaracter_atual = 'Q' THEN
      SET vsom = 'K';
    -- QUA, QUE, QUI, QUO ou QUY
    ELSEIF IFNULL(vcaracter_anterior, '') = 'Q' AND vcaracter_atual = 'U' AND INSTR('AEIOY', vcaracter_seguinte) <> 0 THEN
      SET vsom = '';
    ELSEIF vcaracter_atual = 'W' THEN
      SET vsom = 'V';
    ELSEIF vcaracter_atual = 'X' THEN
      SET vsom = 'S';
    ELSEIF vcaracter_atual = 'Y' THEN
      SET vsom = 'I';
    ELSEIF vcaracter_atual = 'Z' THEN
      SET vsom = 'S';
    ELSE
      SET vsom = vcaracter_atual;
    END IF;

    SET vcaracter_anterior = vcaracter_atual;
    SET vposicao_atual = vposicao_atual + 1;
    SET vtexto_apoio = CONCAT(vtexto_apoio, vsom);
  END WHILE;

  SET vtexto = vtexto_apoio;

  SET vtexto = CONCAT('%', replace(vtexto, ' ', '%'), '%');

  RETURN vtexto;
END
$

The use is as follows::

SELECT *
  FROM publicacao
 WHERE transformar_fonetica(titulo) LIKE transformar_fonetica('$busca');

If there are many records to query will have performance problems, so I advise you to create a speaker just for phonetics and a trigger to update it.

Adding the new column

ALTER TABLE publicacao ADD fonetica TEXT;

Creation of Trigger for updating

DELIMITER $
CREATE TRIGGER publicacao_fonetica AFTER UPDATE ON publicacao
FOR EACH ROW
BEGIN
  if NEW.titulo <=> OLD.titulo THEN
    UPDATE publicacao
       SET fonetica = transformar_fonetica(NEW.titulo)
     WHERE id = OLD.id;
  END IF;
END;
$

Data selection

SELECT *
  FROM publicacao
 WHERE fonetica LIKE transformar_fonetica('$busca');

Observing:

The pronunciation of some letter combinations varies according to the words, so there may be discrepancies. Yet I haven’t really found an example where it doesn’t work. The search also works for first names.

4

For simple cases, the SOUNDEX() function can help

SELECT title, SOUNDEX(title) FROM soundex_test WHERE SOUNDEX(title) like SOUNDEX('%Farmácia%');

Of course you must make some adaptations because Pharmácia is P652 and Pharmacy is F652.

But it helps a lot in cases where phonetics may have some similarities. Example

SELECT title, SOUNDEX(title) FROM soundex_test WHERE SOUNDEX(title) like SOUNDEX('%drug%');

This will return records that contain terms like

drug
droug
droog

If you want to test, you can use these test tables:

CREATE TABLE `soundex_test` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(45) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

INSERT INTO `soundex_test` VALUES ('1', 'Farmácia');
INSERT INTO `soundex_test` VALUES ('4', 'Pharmácia');
INSERT INTO `soundex_test` VALUES ('45', 'aucatra');
INSERT INTO `soundex_test` VALUES ('55', 'alcatara');
INSERT INTO `soundex_test` VALUES ('56', 'alctra');
INSERT INTO `soundex_test` VALUES ('71', 'Pharmacia');
INSERT INTO `soundex_test` VALUES ('72', 'Farmacia');
INSERT INTO `soundex_test` VALUES ('73', 'Memory Card');
INSERT INTO `soundex_test` VALUES ('79', 'Gamepad');
INSERT INTO `soundex_test` VALUES ('80', 'drug');
INSERT INTO `soundex_test` VALUES ('454', 'droug');
INSERT INTO `soundex_test` VALUES ('565', 'alcatra');
INSERT INTO `soundex_test` VALUES ('3445', 'droog');
INSERT INTO `soundex_test` VALUES ('5656', 'acatra');



SELECT title, SOUNDEX(title) FROM soundex_test WHERE SOUNDEX(title) like SOUNDEX('%alcatra%');

Returns:

alcatara
alctra
alcatra

Unfortunately does not return aucatra because the standard is English phonetics. To learn more about soundex: https://en.wikipedia.org/wiki/Soundex

Be aware that this is only an alternative and not a definitive solution.

Another solution is to create dictionary tables and then you manually register the variations. Of course this is colossal work and it takes years, but that’s basically what Google does. Another tip is to always log in what users type in the searches and you will notice certain patterns of typos, grammar errors, etc. Take advantage of these logs to compose the dictionary, relating them to the correct term or relating them to each other.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.