Query using order by slowly

Asked

Viewed 463 times

2

I have a query like this:

select ae_id from app_event
join trip_item on ti_id = ae_trip_item_id
where ae_driver_id = 63 order by ae_date desc limit 10;

The table app_event has about 3 million data, and the table trip_item has about 1 million.

The return and about 40,000 data, and I believe the problem is in ordering all this data, besides the join between two large tables. I used that to place an index in the ae_date (column of order by), but it did not help much. How can I make faster consultation ?

  • You’ve been around a year to see the time estimates and why it’s taking so long?

  • I think you can add indexes to columns that do Join as well. Select with an explain first. explain select ae_id from app_event
join trip_item on ti_id = ae_trip_item_id
where ae_driver_id = 63 order by ae_date desc limit 10; and see at which stage the time is longer

  • @Matheusmordorst did not use the Analyze, but I saw by Explain query that the order by took longer to run.

  • @Rovannlinhalis they already have Dexes, they are also PK and FK all of them (except ae_date, then I added an index to it)

  • It is normal for an ORDER BY to take so long to be done with this amount of data, to try using query caching, or it is not feasible for you?

  • has a look at this article: https://imasters.com.br/artigo/4406/postgresql/optimizando-bancos-postgresql-parte-01/? trace=1519021197&source=single and check out the option work_mem

  • @Fabianolothor I never used the cache. I have several drivers who send eventos to the bank every day, then the data from the tables try to increase. And those responsible for these drivers make these inquiries, daily. I don’t know if in this case I could use the cache

  • Have you considered using this query in a view? if you have permission to do so.

Show 3 more comments

1 answer

3

First a scenario similar to what you described was created: identical structure and a random data mass:

--
-- MODELO DA TABELA app_event
--
create table app_event
(
  ae_id serial primary key,
  ae_driver_id integer,
  ae_trip_item_id integer,
  ae_date timestamp not null
);

--
-- MODELO DA TABELA trip_item
--
create table trip_item
(
  ti_id serial primary key
);

--
-- INSERE 3 MILHOES DE REGISTROS NA TABELA app_event
--
INSERT INTO app_event( ae_id, ae_date, ae_driver_id, ae_trip_item_id )
SELECT
    generate_series( 1, 3000000 ),
    now() - ((random() * 3650) * '1 day'::interval),
    (random() * 100)::int,
    (random() * 100)::int;
--
-- INSERE 1 MILHAO DE REGISTROS NA TABELA trip_item
--
INSERT INTO trip_item( ti_id )
SELECT
    generate_series( 1, 1000000 );

Query under analysis:

SELECT
    ae_id
FROM
    app_event
JOIN
    trip_item ON (ti_id = ae_trip_item_id)
WHERE
    ae_driver_id = 63
ORDER BY
    ae_date DESC
LIMIT
    10;

Your query plan looks something like this:

Limit  (cost=87050.55..87050.58 rows=10 width=12)
  ->  Sort  (cost=87050.55..87089.49 rows=15574 width=12)
        Sort Key: app_event.ae_date
        ->  Hash Join  (cost=28320.00..86714.00 rows=15574 width=12)
              Hash Cond: (app_event.ae_trip_item_id = trip_item.ti_id)
              ->  Seq Scan on app_event  (cost=0.00..58043.59 rows=15574 width=16)
                    Filter: (ae_driver_id = 63)
              ->  Hash  (cost=15045.00..15045.00 rows=1062000 width=4)
                    ->  Seq Scan on trip_item  (cost=0.00..15045.00 rows=1062000 width=4)

Note that the plane analyzer points two points of attention:

1) A Sequential Scan with a cost quite high in the table app_event because of a field filtration ae_driver_id;

2) Other Sequential Scan with a cost quite high in the table trip_item because of Join’s condition using the field ae_trip_item_id.

A sequential search on a table with a large volume of data tends to be slow during read operations. This is why index creation is crucial for good performance.

Creating Indexes:

CREATE INDEX idx_ae_trip_item_id ON app_event (ae_trip_item_id);
CREATE INDEX idx_ae_driver_id ON app_event (ae_driver_id);

New query plan:

Limit  (cost=24050.07..24050.09 rows=10 width=12)
  ->  Sort  (cost=24050.07..24120.51 rows=28178 width=12)
        Sort Key: app_event.ae_date
        ->  Merge Join  (cost=23021.55..23441.15 rows=28178 width=12)
              Merge Cond: (trip_item.ti_id = app_event.ae_trip_item_id)
              ->  Index Only Scan using trip_item_pkey on trip_item  (cost=0.42..30408.42 rows=1000000 width=4)
              ->  Sort  (cost=23014.70..23085.14 rows=28178 width=16)
                    Sort Key: app_event.ae_trip_item_id
                    ->  Bitmap Heap Scan on app_event  (cost=530.81..20932.02 rows=28178 width=16)
                          Recheck Cond: (ae_driver_id = 63)
                          ->  Bitmap Index Scan on idx_ae_driver_id  (cost=0.00..523.76 rows=28178 width=0)
                                Index Cond: (ae_driver_id = 63)

The effect of index creation is the total absence of Sequential Scans in the query plan and a drastic reduction in the cost of operations.

In practice, in my tests, before the creation of the indexes, the query took 450ms to be executed, after the creation of the indexes the running time is around 12ms.

I hope I’ve helped!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.