What’s the difference between Union and Union All?

Asked

Viewed 10,671 times

15

What’s the difference between UNION and UNION ALL? If possible, include examples of use.

2 answers

19


Basically it’s about lines duplicates. UNION will combine lines of the other combined tables that already exist in the result of the queries applied in the previous tables. UNION ALL won’t mind that.

For those who don’t know, UNION combines data from one table with another linearly. It’s not like a JOIN which is done by relationship, it makes a simple "sum" of the rows of one table with another(s). The columns of the tables involved must be the same (quantity and types in the proper positions). It is like placing one under the other.

Behold Sqlfiddle for UNION and Sqlfiddle for UNION ALL.

There are cases that can complicate. If for example you have a column UNIQUE, including there the PRIMARY KEY, may not give the expected result, so in these cases a column like this should not be part of the SELECT. See on Sqlfiddle.

I put in the Github for future reference.

  • 1

    I think it’s worth mentioning that due to being indifferent to identical results, UNION ALL is faster. Also has the case at least in the Sqlserver that I tested, the UNION ALL starts to return partial results faster (therefore, greater availability)

11

An alternative way of seeing UNION and UNION ALL comes directly from mathematics:

  • UNION is an operation of collections of elements that results in a third ensemble;
  • UNION ALL is an operation of collections of elements resulting in a bag.

Here, both ensemble as bag sane collections of elements. The difference between them is given by the "operation of summing up" of elements to a pre-existing collection. I will define it as "sum":

el + C = R

Where el is any element, C is the pre-existing collection and R is the collection resulting from the operation, which contains in its entirety C and also has as an element el.

If el does not previously exist in C, then operations with joint and bag are identical. Now, if el existed previously in C, the sum with a set will result in R == C, not affecting the resulting. However the bag is amended by the addition of el, therefore R != C.

In a way, we can say that the bag is a collection of elements that admits repetition, no longer set admits repetition.

With this, we have interesting results when using UNION and UNION ALL. As a set does not admit repetition, the UNION compare all tuples and return only the single tuples. As a matter of performance, the first operation to be performed will be a total ordering of tuples (time o(n log n)) then eliminate the repetitions (time o(n)). If he did not do this sort before checking the ones, he would have a quadratic execution time. I deal with exactly this problem in that reply, where I explained where these orders of complexity come from. I also demonstrated there that every tuple composed of ordinable elements can also be ordered. Normally one works with numbers, strings, and dates in a database, so in this universe set one can obtain an ordering. One could also use a heuristic for ordering blobs, treating them as a word of bytes and ordering them lexicographically, thus maintaining a more "natural" ordering. For the case of enumerations, as they have labels in strings, we could use these labels and continue with a sorting in the set (although this sorting is no longer a natural sorting).

So, for performance purposes, when requesting a UNION, normally the DBMS will store the entire result of the query, execute a single sort at the end of everything and then get the result of unique tuples. It does not do partial sorting of the dataset because it is extremely bad for performance; perform one sorting each m new data means run o(n/m) times an ordination of o(n log n), which can end up getting worse even if a quadratic sort m is poorly chosen.

This all implies that UNION is not discharged availability, because it will only start to return after obtaining all the data.

Already UNION ALL, as a result of bag, do not need to obtain the entire result in advance to return it. By the time a result is obtained, it can already respond immediately to who made the query, forget this value is catch the next. That makes her availability much bigger. Not to mention that potentially, depending on how was implemented the SQL engine used, the result of this operator does not need to be stored in memory, and can be returned immediately to those who called it using the data obtained.

Perhaps you need to work with ensembles proper, not with bags, but that’s not why you need to use UNION. Of course, this will depend extremely on the semantics of each case, so I do not recommend generalizing. I will give an example where it is possible to obtain a set from the use of UNION ALL.

I have a modeling sort of like this:

[relacionamento via multiplexação, onde uma tabela aponta hipoteticamente para 3 outras a partir de uma chave estrangeira multiplexada pelo valor de outra coluna]

I need to redeem the name and code of all account holders, as well as if he is "supervisor", "seller" or "customer". The foreign key in "current count" is cd_usuario, which in turn connects with cd_cliente, or cd_vendedor, or cd_supervisor, depending on the amount. In my case, each account holder can only have at most one current account. The query would look like this:

SELECT cd_usuario, "cliente" AS tp_correntista, nm_cliente AS nm_correntista
FROM conta_corrente cc INNER JOIN
    cliente c ON (c.cd_cliente = cc.cd_usuario)
WHERE cc.tp_conta = 'c'
UNION
SELECT cd_usuario, "vendedor" AS tp_correntista, nm_vendedor AS nm_correntista
FROM conta_corrente cc INNER JOIN
    vendedor v ON (v.cd_vendedor = cc.cd_usuario)
WHERE cc.tp_conta = 'v'
UNION
SELECT cd_usuario, "supervisor" AS tp_correntista, nm_supervisor AS nm_correntista
FROM conta_corrente cc INNER JOIN
    supervisor s ON (s.cd_supervisor = cc.cd_usuario)
WHERE cc.tp_conta = 's'

Ready, the query returns a set as expected. Now, have you noticed how it’s not possible that, by any chance, there is an equality of tuples between, say, the first query and the second query? This is because all the elements of the first query will have as the second element of its tuples the value "customer", while the second query the value in the same position would be "seller". In addition, how cd_cliente is the primary key of the table cliente and each cliente in this model is only linked to at most one element of the table conta_corrente, then there is no tuple shock within each individual query, so each of the 3 queries above results in a set.

Since we already have 3 sets, and we have assurance that none of these sets have element in common with another set, the operation of "sum" will have the same end result as the "sum" of bags. Therefore, in such cases, the use of the UNION ALL ensures the desired result and also ensures better performance (theoretically at least).

The query can then be rewritten like this:

SELECT cd_usuario, "cliente" AS tp_correntista, nm_cliente AS nm_correntista
FROM conta_corrente cc INNER JOIN
    cliente c ON (c.cd_cliente = cc.cd_usuario)
WHERE cc.tp_conta = 'c'
UNION ALL
SELECT cd_usuario, "vendedor" AS tp_correntista, nm_vendedor AS nm_correntista
FROM conta_corrente cc INNER JOIN
    vendedor v ON (v.cd_vendedor = cc.cd_usuario)
WHERE cc.tp_conta = 'v'
UNION ALL
SELECT cd_usuario, "supervisor" AS tp_correntista, nm_supervisor AS nm_correntista
FROM conta_corrente cc INNER JOIN
    supervisor s ON (s.cd_supervisor = cc.cd_usuario)
WHERE cc.tp_conta = 's'

Browser other questions tagged

You are not signed in. Login or sign up in order to post.