Assuming you have something like:
CREATE TABLE a ( id INTEGER, descricao TEXT );
CREATE TABLE b ( id2 INTEGER );
INSERT INTO a ( id, descricao ) VALUES ( 1, 'alpha' );
INSERT INTO a ( id, descricao ) VALUES ( 2, 'beta' );
INSERT INTO a ( id, descricao ) VALUES ( 3, 'gamma' );
INSERT INTO a ( id, descricao ) VALUES ( 4, 'delta' );
INSERT INTO b (id2) VALUES (1);
INSERT INTO b (id2) VALUES (1);
INSERT INTO b (id2) VALUES (4);
INSERT INTO b (id2) VALUES (4);
In theory, the first query seems to be more performative than the second, because the SELECT
contained within the IN
would run only once to get the universe of IDs
contained in the table b
, for example:
SELECT * FROM a WHERE id IN (SELECT ID2 FROM b)
Or:
SELECT * FROM a WHERE id IN (1,2,3,4)
The EXISTS
of the second query depends on the column value ID
table a
to be executed, i.e., the SELECT
on the table b
would run several times, one for each record contained in the table a
, look at you:
SELECT * FROM a WHERE EXIST (SELECT 1 FROM b WHERE a.ID = b.ID2)
Or:
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 1 = b.ID2) -- TRUE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 2 = b.ID2) -- FALSE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 3 = b.ID2) -- FALSE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 4 = b.ID2) -- TRUE
In practice, the Planner/Optimizer ofPostgreSQL
is clever enough to understand that both darlings produce identical results, and behind the scenes both queries would be executed in the same way.
This can be demonstrated with the command EXPLAIN
:
First Query:
EXPLAIN SELECT * FROM a WHERE id IN (SELECT ID2 FROM b)
Exit:
Hash Join (cost=44.50..83.71 rows=615 width=36)
Hash Cond: (a.id = b.id2)
-> Seq Scan on a (cost=0.00..22.30 rows=1230 width=36)
-> Hash (cost=42.00..42.00 rows=200 width=4)
-> HashAggregate (cost=40.00..42.00 rows=200 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)
Second Query:
EXPLAIN SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE a.ID = b.ID2)
Exit:
Hash Join (cost=44.50..83.71 rows=615 width=36)
Hash Cond: (a.id = b.id2)
-> Seq Scan on a (cost=0.00..22.30 rows=1230 width=36)
-> Hash (cost=42.00..42.00 rows=200 width=4)
-> HashAggregate (cost=40.00..42.00 rows=200 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)
In practice, do not use intuition, use the EXPLAIN
!
I would say that the second option is more optimized, because I don’t need to assemble a collection to verify that the element is in it. The end result is the same, but the solution strategy may not be.
– Jefferson Quesado
The alternative using IN may lead to an unexpected result if the subselect result may be NULL. As stated in the manual: "As usual, null values in the Rows are Combined per the normal Rules of SQL Boolean Expressions. Two Rows are considered Equal if all their corresponding Members are non-null and Equal; the Rows are unequal if any corresponding Members are non-null and unequal; otherwise the result of that Row comparison is Unknown (null). If all the per-Row Results are either unequal or null, with at least one null, then the result of IN is null."
– anonimo