1. Introduction
Clarify the differences and applications of the two statements. The reference is the database manager SQL Server
.
2. What appears in the documentation?
2.1 DISTINCT
SELECT [ ALL | DISTINCT ]
[ TOP ( expression ) [ PERCENT ] [ WITH TIES ] ]
<select_list>
ALL
Especifica que linhas duplicadas podem aparecer no conjunto de resultados.
ALL é o padrão.
DISTINCT
Especifica que só linhas exclusivas podem aparecer no conjunto de resultados.
Valores nulos são considerados iguais para os propósitos
da palavra-chave DISTINCT.
The pair of brackets on the stretch [ ALL | DISTINCT ]
indicates that the two arguments are optional and that if informed are mutually exclusive, that is, either one or the other. As stated in the documentation the argument ALL
is the default, that is, if none of the arguments are in the command, the presence of ALL
. Regarding the argument DISTINCT
the documentation quotes exclusive lines, is understood as non-repeating values.
The definition of <select_list>
is extensive but, for the purpose of this article, only the presence of column names will be considered:
SELECT coluna_1, coluna_2, ..., coluna_n
from tabela;
2.2 GROUP BY
GROUP BY {
<column-expression>
| ROLLUP ( <group_by_expression> [ ,...n ] )
| CUBE ( <group_by_expression> [ ,...n ] )
| GROUPING SETS ( <grouping_set> [ ,...n ] )
| ()
} [ ,...n ]
Groups a set of selected rows into a set of summary rows by the values of one or more columns or expressions. A row is returned to each group. The aggregation functions in the list of <seleção>
of the SELECT clause provide information on each group instead of individual lines.
The purpose of this clause is to group rows where there are same values for the columns defined in the clause, generating subsets. For each subset, aggregation functions can then be performed in the remaining columns. At the end, for each subset a single row is returned, containing the grouping columns and the results of the aggregation functions.
For the definition of <column_expression>
we will consider column names only.
2.3 Functions of aggregation
In the documentation of GROUP BY
appears aggregation functions.
When consulting the documentation regarding aggregation functions, we have
Aggregation functions perform a calculation on a set of values
and return a single value. Aggregation functions are usually
used with the clause GROUP BY
of the SELECT instruction.
In a simple template, the aggregation functions listed in the SELECT clause are executed for each subset generated by the GROUP BY clause.
As examples of aggregation function we have:
COUNT: Retorna o número de itens em um grupo.
AVG: Retorna a média dos valores em um grupo.
SUM: Retorna a soma de uma expressão numérica avaliada em um conjunto especificado.
3. Demonstration of the use of resources
To demonstrate the application of DISTINCT and GROUP BY, we will use the following table:
-- código #1
CREATE TABLE Vendas (
NomeVendedor varchar(30),
ProdutoVendido varchar(50),
QuantidadeVendida integer,
ValorVenda money
);
INSERT
-- código #2
INSERT into VENDAS values
('João', 'Macarrão', 18, 35.00),
('Maria', 'Beterraba', 3, 12.00),
('José', 'Cenoura', 5, 5.00),
('João', 'Molho de tomate', 1, 7.50),
('Antônio', 'Beterraba', 4, 16.00),
('João', 'Macarrão', 3, 4.20);
And you need to generate the following reports:
- What are the sellers?
- Which products each seller sold?
- Total sales, in real, of each seller?
- How many items sold per product per seller?
- How many different products were sold by each seller?
3.1 What are the sellers?
-- código #3
SELECT NomeVendedor
from VENDAS;
João
Maria
José
João
Antônio
João
However, you realize that the name John appears 3 times. How to eliminate repetitions? It is a typical application of using DISTINCT
!
-- código #3a
SELECT DISTINCT NomeVendedor
from VENDAS;
Antônio
João
José
Maria
When reviewing the code implementation plans #3
and #3a
, the difference can easily be seen: The presence of the logical operator DISTINCT SORT
in the Code Implementation Plan #3a
.
Sqlfiddle
3.2 What products each seller sold?
This request involves the use of two columns: NomeVendedor
and ProdutoVendido
.
-- código #5
SELECT NomeVendedor, ProdutoVendido
from VENDAS;
The result is that the pair {João, Macarrão}
appears more than once. Here’s another typical application of using DISTINCT
, but now acting on two columns.
-- código #5a
SELECT DISTINCT NomeVendedor, ProdutoVendido
from VENDAS;
Important: DISTINCT
acts simultaneously in the columns NomeVendedor
and ProdutoVendido
. Consider the two columns to eliminate repetitions.
3.3 Total real sales of each seller?
To fulfill this request you will need to add the contents of the column ValorVenda
for each seller. That is, it will be necessary to first separate the sales per seller (generating a subset with the lines of each seller) and then carry out the sum of each subset.
This is a typical application of the bundling clause GROUP BY
.
To group the lines per seller we use
GROUP BY NomeVendedor
And to add sales, we use the SUM aggregation function
Sum(ValorVenda)
Code
-- código #6
SELECT NomeVendedor, SUM(ValorVenda)
from Vendas
group by NomeVendedor;
Antônio 16,00
João 46,70
José 5,00
Maria 12,00
This is the first code of this article with the clause GROUP BY
. Analysing the implementation plan of código #6
, something that was not included in the previous implementation plans, which is the operator Stream Aggregate
.
Sqlfiddle
As documented by the operator Stream aggregate
, groups rows through one or more columns and then calculates one or more aggregation expressions returned by the query.
The operator Stream Aggregate
requires data entry sorted by columns within their groups. To ensure this condition, the query optimizer adds an operator Sort
before this operator (if the data are not yet classified). This can be observed in the above execution plan, as the table Vendas
is the type heap and without any index.
3.4 How many items of each product were sold per seller?
To fulfill this request you will need to create subsets per seller and within each of these subsets, create subsets per product. This is possible because the clause GROUP BY
allows the definition of more than one column.
To group the lines per seller we use
GROUP BY Nomevendedor
To group each product within each subset, we add the column that identifies the product
GROUP BY NomeVendedor, ProdutoVendido
And to add up the quantity of items sold we use again the aggregation function SUM
Sum(QuantidadeVendida)
Code:
-- código #7
SELECT NomeVendedor, ProdutoVendido, sum (QuantidadeVendida)
from Vendas
group by NomeVendedor, ProdutoVendido;
Antônio Beterraba 4
Maria Beterraba 3
José Cenoura 5
João Macarrão 21
João Molho de Tomate 1
3.5 How many different products were sold by each seller?
To count how many different products were sold by each seller the aggregation function COUNT
is the ideal
-- código #8
SELECT NomeVendedor, count (ProdutoVendido)
from Vendas
group by NomeVendedor;
Antonio 1
João 3
José 1
Maria 1
When consulting the result, and compare with the contents of the table Vendas
, we realized that for the Seller João 3 products were accounted for when he only sold two types of products: Noodle and Tomato sauce. But he made two noodle sales. How to do for that aggregation function COUNT
only add once each product? The answer is in the use of DISTINCT
within the function parameter, as per
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
-- código #8a
SELECT NomeVendedor, count (distinct ProdutoVendido)
from Vendas
group by NomeVendedor;
Antonio 1
João 2
José 1
Maria 1
Now yes the result came right, with the seller João
accounting for 2 different products.
And how were the execution plans of codes #8 and #8a?
Sqlfiddle
The presence of the operator can be seen in both Stream Aggregate
due to the clause GROUP BY
. And in the second consultation, concerning código #8a
the presence of the logical operator DISTINCT SORT
, before performing the grouping. This logical operator was included by the query optimizer to process DISTINCT ProdutoVendido
.
4. GROUP BY in place of DISTINCT
When the clause GROUP BY
is used without an aggregation function in the SELECT clause, it has a similar effect to DISTINCT
.
For example, code #3a can be rewritten, replacing DISTINCT with GROUP BY:
-- código #3a
SELECT DISTINCT NomeVendedor
from Vendas;
-- código #9
SELECT NomeVendedor
from Vendas
group by NomeVendedor;
The return of código #9
is
Antônio
João
José
Maria
The implementation plan is as follows::
That is, the same result and the same execution plan were generated for the two queries.
5. Partial considerations
The question of this topic served as the basis for writing article on the subject. In the references item at the end there is the link to the full article.
In this text the table Vendas
is the type heap and without indexes nonclustered. This type was chosen to demonstrate the conceptual functioning of GROUP BY
and DISTINCT
because the presence of indexes may change the execution plan generated.
DISTINCT
and GROUP BY
do not do the same thing, have different goals and usually generate different execution plans.
There are exceptions.
6. References
6.1 Documentation
6.2 Full Article
But the
group by
allows to know how many were grouped, or the largest of each grouped block– Isac
@Isac and on performance issues? They are also used in the same way!
– Marconi
The goal of GROUP BY is to group lines that have a similar subset and perform aggregation functions on these lines. The goal of DISTINCT is to return a subset of lines without repetitions. // Using GROUP BY without aggregation function generates the same result as using DISTINCT.
– José Diz
@Doesn’t Josédiz want to make an answer? I think this is a normal question.
– Marconi