To sort this array you need a similarity index that tells you that Guarulhos is the "best answer" of the three, something that has to be returned by the server that told you these are the most similar results...
Using pg_trgm
With Postgresql I would use the module pg_trgm, that returns you such index and already allows you to sort with ORDER BY
right in the query, saving any extra work in PHP. For example, I created an "airport" table with the three records you showed:
create table aeroportos (id int not null, iata text not null, full_name text not null);
insert into aeroportos values
(1059, 'BPA', 'Grumman, Bethpager, Bethpage, (BPA)'),
(3419, 'GUU', 'Bakki, Grundarfjorfur, Grundarfjordur, (GUU)'),
(8083, 'GRU', 'Guarulhos, São Paulo, (GRU)');
Here we can use the function word_similarity()
, that returns the degree of similarity between a string and the word most similar to it in another string. The higher the value in the column grau
(from 0 to 1), the greater the similarity between the strings.
Note that I concatenated the columns iata
and full_name
to compare with "GRU" as one thing.
select *, word_similarity(iata||full_name, 'GRU') as grau
from aeroportos
order by grau desc;
id | iata | full_name | grau
------+------+----------------------------------------------+----------
8083 | GRU | Guarulhos, São Paulo, (GRU) | 0.166667
3419 | GUU | Bakki, Grundarfjordur, Grundarfjordur, (GUU) | 0.125
1059 | BPA | Grumman, Bethpager, Bethpage, (BPA) | 0.047619
(3 rows)
Detailing a little more...
Since your table certainly has more than three records, it is interesting to use also similarity operators to filter only those that look enough with your search, otherwise every query will return the entire list of airports, mostly with similarity '0.0'. You can simply use LIMIT and cut to the amount of records that is most convenient for you, but you can also use other Postgresql features to filter by similarity, something that will give you much more performance for large volumes of data if added to the use of GIST indexes.
For this we can use the operator (which is also part of the pg_trgm package) <%
in the clause WHERE
. This operator returns true
if the first parameter is a string that contains a word similar to the second parameter, provided that the word similarity degree is greater than the minimum limit specified by the option pg_trgm.word_similarity_threshold
. As already noted by the previous query, the general degree of similarity of our comparisons is generally low, since the "right answer" for GRU has a score of only 0.16. So we should lower the minimum limiter so that our WHERE
does not return zero records. The smaller the pg_trgm.word_similarity_threshold
, more records will be returned:
set pg_trgm.word_similarity_threshold to 0.1;
select *, word_similarity(iata||full_name, 'GRU') as grau
from aeroportos
where (iata||full_name) <% 'GRU'
order by grau desc;
id | iata | full_name | grau
------+------+----------------------------------------------+----------
8083 | GRU | Guarulhos, São Paulo, (GRU) | 0.166667
3419 | GUU | Bakki, Grundarfjordur, Grundarfjordur, (GUU) | 0.125
(2 rows)
Set this whole course of action, in PHP the thing becomes simple. Just send the SET
for the minimum limit of similarity together with the query:
$query = "set pg_trgm.word_similarity_threshold to 0.1; select *, word_similarity(iata||full_name, 'GRU') as grau from aeroportos where (iata||full_name) <% 'GRU' order by grau desc;";
$res = pg_query($conn, $query);
Or, if you do not want to pass the limiter every time, you can even set it as property of the user of your application:
alter user nunks set pg_trgm.word_similarity_threshold = 0.1;
select usename, useconfig from pg_user where usename = 'nunks';
usename | useconfig
---------+-----------------------------------------
nunks | {pg_trgm.word_similarity_threshold=0.1}
(1 row)
Installing the pg_trgm extension
If you do not have pg_trgm installed, use the command CREATE EXTENSION
:
> create extension pg_trgm;
CREATE EXTENSION
If the command fails you probably need to install the extension pack postgresql-contrib, which can be either compiled or installed from the package manager of your choice. In Windows, the graphical installer gives you the option of which contrib modules to install.
Beyond the function word_similarity()
that we use here, the extension pg_trgm
offers other associated features such as GIN and GIST indexes capable of greatly speeding up the search response for similarity in large volumes of data, special operators and methods for fine-tuning the limit values of similarity.
In Oracle I use Jaro Winkler and/or EDIT_DISTANCE , for Postgreesql in Google I found this https://gist.github.com/fjavier/586c713943d76a023a70 I use to search strings by "proximity" see if it helps
– Motta
Very interesting this implementation, I had no idea how the algorithms of similarity between strings worked! Still, I suggest using pg_trgm as a "native" Postgres solution, which supports similarity indexing and so on, hehe...
– nunks