Difference between EXISTS and IN in postgres?

Asked

Viewed 1,454 times

1

I have a doubt that I have already researched here, but for me it made no sense the explanations.

I have this query:

SELECT * FROM a WHERE id IN (SELECT ID2 FROM b)

And this:

SELECT * FROM a WHERE EXIST (SELECT 1 FROM b WHERE a.ID = b.ID2).

What is the difference in the result of both? What can happen in one that does not happen in the other?

It can also be considered the NOT IN and the NOT EXISTS.

the only kind of answer I’ve seen is that one is more performative than the other, but that’s irrelevant to me.

  • I would say that the second option is more optimized, because I don’t need to assemble a collection to verify that the element is in it. The end result is the same, but the solution strategy may not be.

  • The alternative using IN may lead to an unexpected result if the subselect result may be NULL. As stated in the manual: "As usual, null values in the Rows are Combined per the normal Rules of SQL Boolean Expressions. Two Rows are considered Equal if all their corresponding Members are non-null and Equal; the Rows are unequal if any corresponding Members are non-null and unequal; otherwise the result of that Row comparison is Unknown (null). If all the per-Row Results are either unequal or null, with at least one null, then the result of IN is null."

1 answer

2

Assuming you have something like:

CREATE TABLE a ( id INTEGER, descricao TEXT );
CREATE TABLE b ( id2 INTEGER );

INSERT INTO a ( id, descricao ) VALUES ( 1, 'alpha' );
INSERT INTO a ( id, descricao ) VALUES ( 2, 'beta' );
INSERT INTO a ( id, descricao ) VALUES ( 3, 'gamma' );
INSERT INTO a ( id, descricao ) VALUES ( 4, 'delta' );

INSERT INTO b (id2) VALUES (1);
INSERT INTO b (id2) VALUES (1);
INSERT INTO b (id2) VALUES (4);
INSERT INTO b (id2) VALUES (4);

In theory, the first query seems to be more performative than the second, because the SELECT contained within the IN would run only once to get the universe of IDs contained in the table b, for example:

SELECT * FROM a WHERE id IN (SELECT ID2 FROM b)

Or:

SELECT * FROM a WHERE id IN (1,2,3,4)

The EXISTS of the second query depends on the column value ID table a to be executed, i.e., the SELECT on the table b would run several times, one for each record contained in the table a, look at you:

SELECT * FROM a WHERE EXIST (SELECT 1 FROM b WHERE a.ID = b.ID2)

Or:

SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 1 = b.ID2) -- TRUE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 2 = b.ID2) -- FALSE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 3 = b.ID2) -- FALSE
SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE 4 = b.ID2) -- TRUE

In practice, the Planner/Optimizer ofPostgreSQL is clever enough to understand that both darlings produce identical results, and behind the scenes both queries would be executed in the same way.

This can be demonstrated with the command EXPLAIN:

First Query:

EXPLAIN SELECT * FROM a WHERE id IN (SELECT ID2 FROM b)

Exit:

Hash Join  (cost=44.50..83.71 rows=615 width=36)
  Hash Cond: (a.id = b.id2)
  ->  Seq Scan on a  (cost=0.00..22.30 rows=1230 width=36)
  ->  Hash  (cost=42.00..42.00 rows=200 width=4)
        ->  HashAggregate  (cost=40.00..42.00 rows=200 width=4)
              ->  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

Second Query:

EXPLAIN SELECT * FROM a WHERE EXISTS (SELECT 1 FROM b WHERE a.ID = b.ID2)

Exit:

Hash Join  (cost=44.50..83.71 rows=615 width=36)
  Hash Cond: (a.id = b.id2)
  ->  Seq Scan on a  (cost=0.00..22.30 rows=1230 width=36)
  ->  Hash  (cost=42.00..42.00 rows=200 width=4)
        ->  HashAggregate  (cost=40.00..42.00 rows=200 width=4)
              ->  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

In practice, do not use intuition, use the EXPLAIN!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.