Why are cost estimates in Postgresql so wrong?

Asked

Viewed 1,178 times

48

Could someone help me understand why Postgresql missed several cost estimates in an experiment I did.

I’m doing an experiment with 22 darlings of Benchmark TPCH[1], to verify index performance in queries.

From 22 consultations, only 5 consultations were verified that the optimizer used secondary indexes. In another experiment, the 5 queries cited in a database without the presence of indexes were executed, aiming to identify if the execution time would increase due to the absence of indexes.

But to my surprise, the experiment without the presence of indexes in the database was faster than using the indexes (for the 22 queries) .

I would like to understand why the total cost parameter was wrong in all cases, that is to say, all queries that spent the most time, indicated a lower cost - in all 5 cases, that is incorrect in my opinion.

See below, that the first line refers to query 6 you used index, the cost was 3335809, lower than the cost 5255959, of the same query 6 that did not use índice.

Also look at the time spent. A query that used index spent 7 minutes while without the use of indexes took 55 seconds. This behavior extended to other cases.

My question is: Why the total cost (execution plan) incorrectly calculates the cost, for all cases where there are indexes?

Índices  |Query   |Tempo_Gasto     |Custo_Total
===============================================
Sim       6        00:07:56         3335809.61
Nao       6        00:00:55         5255959.00
Sim       7        00:09:16         5847359.97
Nao       7        00:02:08         6793148.45
Sim       10       00:07:04         40743017.17
Nao       10       00:02:14         41341406.62
Sim       15       00:10:03         6431359.90
Nao       15       00:01:56         9608659.87
Sim       20       00:12:48         8412159.69
Nao       20       00:05:49         9537835.93
=============================================

For reasons of use of a Patch specific to my research project, I am using an old version 9.0.1 (year 2012).

I didn’t modify the standard parameters, only the random_page_cost to 1, because I am using a disc SSD, where the random access cost is less than in a HDD. Follow what’s in mine postgresql.conf

max_connections = 100 
effective_io_concurrency = 5 
#seq_page_cost = 1.0                    
random_page_cost = 1.0                  
#cpu_tuple_cost = 0.01                  
#cpu_index_tuple_cost = 0.005        
#cpu_operator_cost = 0.0025             
#effective_cache_size = 128MB

Did any of these parameters interfere to make this error in the cost statistics?

Follow query 6 and its Explain Analyse with and without indexes.

------------- QUERY 6 ---------------------------------------------
select
    sum(l_extendedprice * l_discount) as revenue
from
    lineitem
where
    l_shipdate >= date '1995-01-01'
    and l_shipdate < date '1995-01-01' + interval '1' year
    and l_discount between 0.09 - 0.01 and 0.09 + 0.01
    and l_quantity < 24;

---- COM ÍNDICE (idx_l_shipdatelineitem000)-------------------
Plano Execucao:Aggregate  (cost=3335809.59..3335809.61 rows=1
width=16) (actual time=476033.847..476033.847 rows=1 loops=1)
 ->  Bitmap Heap Scan on lineitem  (cost=376416.20..3330284.29
rows=2210122 width=16) (actual time=375293.183..471695.391
rows=2282333 loops=1)
       Recheck Cond: ((l_shipdate >= _1995-01-01_::date) AND
(l_shipdate < _1996-01-01 00:00:00_::timestamp without time zone))
        Filter: ((l_discount >= 0.08) AND (l_discount <= 0.10) AND
(l_quantity < 24::numeric))
        ->  Bitmap Index Scan on idx_l_shipdatelineitem000
(cost=0.00..375863.67 rows=17925026 width=0) (actual
time=375289.456..375289.456 rows=18211743 loops=1)
              Index Cond: ((l_shipdate >= _1995-01-01_::date) AND
(l_shipdate < _1996-01-01 00:00:00_::timestamp without time
zone))Total runtime: 476034.306 ms


------------------ SEM USO DE ÍNDICE ----------------------------
Plano Execucao:Aggregate  (cost=5255958.99..5255959.00 rows=1
width=16) (actual time=55051.051..55051.051 rows=1 loops=1)
  ->  Seq Scan on lineitem  (cost=0.00..5250433.68 rows=2210122
width=16) (actual time=0.394..52236.276 rows=2282333 loops=1)
        Filter: ((l_shipdate >= _1995-01-01_::date) AND (l_shipdate <
_1996-01-01 00:00:00_::timestamp without time zone)
        AND (l_discount >= 0.08) AND (l_discount <= 0.10) AND
(l_quantity < 24::numeric))Total runtime: 55051.380 ms

Any hint is welcome!!

  • 7

    What is the structure of the index? By explain, apparently the index is applied only to the column l_shipdate, but its filter conditions depend on access to record values. There is difference in times if a composite index is created with the fields [l_shipdate, l_quantity, l_discount]?

  • 6

    Who index type are you using to index the fields ? All fields that are part of your WHERE have indexes ? What is the volume of data contained in your tables ? How about a practical example in Sqlfiddle ?

  • 7

    For us to answer this question it will be necessary to pass the data contained in the tables. postgre makes the planning based on estimates of the data (including that is why you need to update these estimates periodically, to always relay the current state of the database).

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.