How to sort the results more similar to the parameter passed?

Asked

Viewed 101 times

1

I have a table in my system that stores all airports and in a select autocomplete i have this list where I can search.

The problem is that I can search for "nome", "cidade", "país" and "IATA" and accurate retornar em ordem de mais similar the airports found.

For example:

my parameter is "GRU" and the function returns me this:

{
    "airports": [
        {
            "id": 1059,
            "iata": "BPA",
            "full_name": "Grumman, Bethpage, Bethpage, (BPA)"
        },
        {
            "id": 3410,
            "iata": "GUU",
            "full_name": "Bakki, Grundarfjordur, Grundarfjordur, (GUU)"
        },
        {
            "id": 8083,
            "iata": "GRU",
            "full_name": "Guarulhos, São Paulo, São Paulo, (GRU)"
        }
    ],
}

As you can see, the return is right, but I, as a user, researched the IATA and the function returned me everything that was similar to what I researched.

The code is as follows::

$str_where = "CITEXT(
                            TRANSLATE(
                                TRIM(
                                    name || ', ' || city || ', ' || country || ', ' || '(' || iata || ')' 
                                ) , 
                                'áàâãäåaaaÁÂÃÄÅAAAÀéèêëeeeeeEEEÉEEÈìíîïìiiiÌÍÎÏÌIIIóôõöoooòÒÓÔÕÖOOOùúûüuuuuÙÚÛÜUUUUçÇñÑýÝ',
                                'aaaaaaaaaAAAAAAAAAeeeeeeeeeEEEEEEEiiiiiiiiIIIIIIIIooooooooOOOOOOOOuuuuuuuuUUUUUUUUcCnNyY'
                            )
                        )";

            $airports = $connection
                    ->newQuery()
                    ->select(['id', 'iata'])
                    ->select(["CITEXT(name || ', ' || city || ', ' || country || ', ' || '(' || iata || ')' ) AS full_name"])
                    ->from('white_label_airports')
                    ->where([$str_where . ' LIKE' => '%' . \Cake\Utility\Text::transliterate($this->request->data('term')) . '%'])
                    ->limit(20)
                    ->execute()
                    ->fetchAll('assoc');

I wonder if I have how to sort by the most similar of the server response and if anyone has any better idea.

Thank you in advance!

  • 1

    In Oracle I use Jaro Winkler and/or EDIT_DISTANCE , for Postgreesql in Google I found this https://gist.github.com/fjavier/586c713943d76a023a70 I use to search strings by "proximity" see if it helps

  • Very interesting this implementation, I had no idea how the algorithms of similarity between strings worked! Still, I suggest using pg_trgm as a "native" Postgres solution, which supports similarity indexing and so on, hehe...

1 answer

0

To sort this array you need a similarity index that tells you that Guarulhos is the "best answer" of the three, something that has to be returned by the server that told you these are the most similar results...

Using pg_trgm

With Postgresql I would use the module pg_trgm, that returns you such index and already allows you to sort with ORDER BY right in the query, saving any extra work in PHP. For example, I created an "airport" table with the three records you showed:

create table aeroportos (id int not null, iata text not null, full_name text not null);
insert into aeroportos values 
(1059, 'BPA', 'Grumman, Bethpager, Bethpage, (BPA)'),
(3419, 'GUU', 'Bakki, Grundarfjorfur, Grundarfjordur, (GUU)'),
(8083, 'GRU', 'Guarulhos, São Paulo, (GRU)');

Here we can use the function word_similarity(), that returns the degree of similarity between a string and the word most similar to it in another string. The higher the value in the column grau (from 0 to 1), the greater the similarity between the strings. Note that I concatenated the columns iata and full_name to compare with "GRU" as one thing.

select *, word_similarity(iata||full_name, 'GRU') as grau 
from aeroportos 
order by grau desc;

  id  | iata |                  full_name                   |   grau   
------+------+----------------------------------------------+----------
 8083 | GRU  | Guarulhos, São Paulo, (GRU)                  | 0.166667
 3419 | GUU  | Bakki, Grundarfjordur, Grundarfjordur, (GUU) |    0.125
 1059 | BPA  | Grumman, Bethpager, Bethpage, (BPA)          | 0.047619
(3 rows)

Detailing a little more...

Since your table certainly has more than three records, it is interesting to use also similarity operators to filter only those that look enough with your search, otherwise every query will return the entire list of airports, mostly with similarity '0.0'. You can simply use LIMIT and cut to the amount of records that is most convenient for you, but you can also use other Postgresql features to filter by similarity, something that will give you much more performance for large volumes of data if added to the use of GIST indexes.

For this we can use the operator (which is also part of the pg_trgm package) <% in the clause WHERE. This operator returns true if the first parameter is a string that contains a word similar to the second parameter, provided that the word similarity degree is greater than the minimum limit specified by the option pg_trgm.word_similarity_threshold. As already noted by the previous query, the general degree of similarity of our comparisons is generally low, since the "right answer" for GRU has a score of only 0.16. So we should lower the minimum limiter so that our WHERE does not return zero records. The smaller the pg_trgm.word_similarity_threshold, more records will be returned:

set pg_trgm.word_similarity_threshold to 0.1;

select *, word_similarity(iata||full_name, 'GRU') as grau 
from aeroportos 
where (iata||full_name) <% 'GRU' 
order by grau desc;

  id  | iata |                  full_name                   |   grau   
------+------+----------------------------------------------+----------
 8083 | GRU  | Guarulhos, São Paulo, (GRU)                  | 0.166667
 3419 | GUU  | Bakki, Grundarfjordur, Grundarfjordur, (GUU) |    0.125
(2 rows)

Set this whole course of action, in PHP the thing becomes simple. Just send the SET for the minimum limit of similarity together with the query:

$query = "set pg_trgm.word_similarity_threshold to 0.1; select *, word_similarity(iata||full_name, 'GRU') as grau from aeroportos where (iata||full_name) <% 'GRU' order by grau desc;";
$res = pg_query($conn, $query);

Or, if you do not want to pass the limiter every time, you can even set it as property of the user of your application:

alter user nunks set pg_trgm.word_similarity_threshold = 0.1;
select usename, useconfig from pg_user where usename = 'nunks';
 usename |                useconfig                
---------+-----------------------------------------
 nunks   | {pg_trgm.word_similarity_threshold=0.1}
(1 row)

Installing the pg_trgm extension

If you do not have pg_trgm installed, use the command CREATE EXTENSION:

> create extension pg_trgm;
CREATE EXTENSION

If the command fails you probably need to install the extension pack postgresql-contrib, which can be either compiled or installed from the package manager of your choice. In Windows, the graphical installer gives you the option of which contrib modules to install.

Beyond the function word_similarity() that we use here, the extension pg_trgm offers other associated features such as GIN and GIST indexes capable of greatly speeding up the search response for similarity in large volumes of data, special operators and methods for fine-tuning the limit values of similarity.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.