DISTINCT and GROUP BY, what is the difference between the two statements?

Asked

Viewed 60,507 times

50

DISTINCT

The SELECT DISTINCT statement is used to return only values distinct (different).

Within a table, a column usually contains many values duplicated; and sometimes you just want to list the different values (distinct).

Syntax

SELECT DISTINCT column1, column2, ...
FROM table_name;

GROUP BY

The GROUP BY instruction is usually used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result set by one or more columns.

Syntax

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);

Distinct returns different values, grouped values(Group by) are also distinct(different).

To demonstrate that the same results are returned using both statements I made a test.

SQL Fiddle

Some doubts:

What is the difference between the two statements, besides the syntax? Regarding performance, are there differences? Is there any possibility(example in practice) where the use of the 2 declarations is required? They could give examples of optimal use of each of the statements?

It may sound silly, but these kinds of statements confuse people and although they do not seem, the use of both bring equal results.

  • But the group by allows to know how many were grouped, or the largest of each grouped block

  • @Isac and on performance issues? They are also used in the same way!

  • The goal of GROUP BY is to group lines that have a similar subset and perform aggregation functions on these lines. The goal of DISTINCT is to return a subset of lines without repetitions. // Using GROUP BY without aggregation function generates the same result as using DISTINCT.

  • 1

    @Doesn’t Josédiz want to make an answer? I think this is a normal question.

5 answers

45


1. Introduction

Clarify the differences and applications of the two statements. The reference is the database manager SQL Server.


2. What appears in the documentation?

2.1 DISTINCT

SELECT [ ALL | DISTINCT ]  
[ TOP ( expression ) [ PERCENT ] [ WITH TIES ] ]   
<select_list>  

ALL
Especifica que linhas duplicadas podem aparecer no conjunto de resultados. 
ALL é o padrão.
DISTINCT
Especifica que só linhas exclusivas podem aparecer no conjunto de resultados. 
Valores nulos são considerados iguais para os propósitos
da palavra-chave DISTINCT.

The pair of brackets on the stretch [ ALL | DISTINCT ] indicates that the two arguments are optional and that if informed are mutually exclusive, that is, either one or the other. As stated in the documentation the argument ALL is the default, that is, if none of the arguments are in the command, the presence of ALL. Regarding the argument DISTINCT the documentation quotes exclusive lines, is understood as non-repeating values.

The definition of <select_list> is extensive but, for the purpose of this article, only the presence of column names will be considered:

SELECT coluna_1, coluna_2, ..., coluna_n
  from tabela;

2.2 GROUP BY

GROUP BY {
      <column-expression>  
    | ROLLUP ( <group_by_expression> [ ,...n ] )  
    | CUBE ( <group_by_expression> [ ,...n ] )  
    | GROUPING SETS ( <grouping_set> [ ,...n ]  )  
    | () 
} [ ,...n ]

Groups a set of selected rows into a set of summary rows by the values of one or more columns or expressions. A row is returned to each group. The aggregation functions in the list of <seleção> of the SELECT clause provide information on each group instead of individual lines.

The purpose of this clause is to group rows where there are same values for the columns defined in the clause, generating subsets. For each subset, aggregation functions can then be performed in the remaining columns. At the end, for each subset a single row is returned, containing the grouping columns and the results of the aggregation functions. For the definition of <column_expression> we will consider column names only.

2.3 Functions of aggregation

In the documentation of GROUP BY appears aggregation functions. When consulting the documentation regarding aggregation functions, we have

Aggregation functions perform a calculation on a set of values and return a single value. Aggregation functions are usually used with the clause GROUP BY of the SELECT instruction.

In a simple template, the aggregation functions listed in the SELECT clause are executed for each subset generated by the GROUP BY clause.

As examples of aggregation function we have:

COUNT: Retorna o número de itens em um grupo.
AVG: Retorna a média dos valores em um grupo.
SUM: Retorna a soma de uma expressão numérica avaliada em um conjunto especificado.

3. Demonstration of the use of resources

To demonstrate the application of DISTINCT and GROUP BY, we will use the following table:

-- código #1
CREATE TABLE Vendas (
  NomeVendedor varchar(30),
  ProdutoVendido varchar(50),
  QuantidadeVendida integer,
  ValorVenda money
);

INSERT

-- código #2
INSERT into VENDAS values
    ('João', 'Macarrão', 18, 35.00),
    ('Maria', 'Beterraba', 3, 12.00),
    ('José', 'Cenoura', 5, 5.00),
    ('João', 'Molho de tomate', 1, 7.50),
    ('Antônio', 'Beterraba', 4, 16.00),
    ('João', 'Macarrão', 3, 4.20);

And you need to generate the following reports:

  • What are the sellers?
  • Which products each seller sold?
  • Total sales, in real, of each seller?
  • How many items sold per product per seller?
  • How many different products were sold by each seller?

3.1 What are the sellers?

-- código #3
SELECT NomeVendedor 
  from VENDAS;

João
Maria
José
João
Antônio
João

However, you realize that the name John appears 3 times. How to eliminate repetitions? It is a typical application of using DISTINCT!

-- código #3a
SELECT DISTINCT NomeVendedor 
  from VENDAS;

Antônio
João
José
Maria

When reviewing the code implementation plans #3 and #3a, the difference can easily be seen: The presence of the logical operator DISTINCT SORT in the Code Implementation Plan #3a.

inserir a descrição da imagem aqui

inserir a descrição da imagem aqui

Sqlfiddle

3.2 What products each seller sold?

This request involves the use of two columns: NomeVendedor and ProdutoVendido.

-- código #5
SELECT NomeVendedor, ProdutoVendido 
  from VENDAS;

The result is that the pair {João, Macarrão} appears more than once. Here’s another typical application of using DISTINCT, but now acting on two columns.

-- código #5a
SELECT DISTINCT NomeVendedor, ProdutoVendido 
  from VENDAS;

Important: DISTINCT acts simultaneously in the columns NomeVendedor and ProdutoVendido . Consider the two columns to eliminate repetitions.

3.3 Total real sales of each seller?

To fulfill this request you will need to add the contents of the column ValorVenda for each seller. That is, it will be necessary to first separate the sales per seller (generating a subset with the lines of each seller) and then carry out the sum of each subset.

This is a typical application of the bundling clause GROUP BY.

To group the lines per seller we use

GROUP BY NomeVendedor

And to add sales, we use the SUM aggregation function

Sum(ValorVenda)

Code

-- código #6
SELECT NomeVendedor, SUM(ValorVenda) 
  from Vendas
  group by NomeVendedor;

Antônio 16,00
João 46,70
José 5,00
Maria 12,00

This is the first code of this article with the clause GROUP BY. Analysing the implementation plan of código #6, something that was not included in the previous implementation plans, which is the operator Stream Aggregate.

inserir a descrição da imagem aqui

Sqlfiddle

As documented by the operator Stream aggregate, groups rows through one or more columns and then calculates one or more aggregation expressions returned by the query.

The operator Stream Aggregate requires data entry sorted by columns within their groups. To ensure this condition, the query optimizer adds an operator Sort before this operator (if the data are not yet classified). This can be observed in the above execution plan, as the table Vendas is the type heap and without any index.

3.4 How many items of each product were sold per seller?

To fulfill this request you will need to create subsets per seller and within each of these subsets, create subsets per product. This is possible because the clause GROUP BY allows the definition of more than one column. To group the lines per seller we use

GROUP BY Nomevendedor

To group each product within each subset, we add the column that identifies the product

GROUP BY NomeVendedor, ProdutoVendido

And to add up the quantity of items sold we use again the aggregation function SUM

   Sum(QuantidadeVendida)

Code:

-- código #7
SELECT NomeVendedor, ProdutoVendido, sum (QuantidadeVendida)
  from Vendas
  group by NomeVendedor, ProdutoVendido;

Antônio Beterraba 4
Maria Beterraba 3
José Cenoura 5
João Macarrão 21
João Molho de Tomate 1

3.5 How many different products were sold by each seller?

To count how many different products were sold by each seller the aggregation function COUNT is the ideal

-- código #8
SELECT NomeVendedor, count (ProdutoVendido)
  from Vendas
  group by NomeVendedor;

Antonio 1
João 3
José 1
Maria 1

When consulting the result, and compare with the contents of the table Vendas, we realized that for the Seller João 3 products were accounted for when he only sold two types of products: Noodle and Tomato sauce. But he made two noodle sales. How to do for that aggregation function COUNT only add once each product? The answer is in the use of DISTINCTwithin the function parameter, as per

COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )


-- código #8a
SELECT NomeVendedor, count (distinct ProdutoVendido)
  from Vendas
group by NomeVendedor;

Antonio 1
João 2
José 1
Maria 1

Now yes the result came right, with the seller João accounting for 2 different products.

And how were the execution plans of codes #8 and #8a?

inserir a descrição da imagem aqui inserir a descrição da imagem aqui

Sqlfiddle

The presence of the operator can be seen in both Stream Aggregate due to the clause GROUP BY. And in the second consultation, concerning código #8a the presence of the logical operator DISTINCT SORT, before performing the grouping. This logical operator was included by the query optimizer to process DISTINCT ProdutoVendido.


4. GROUP BY in place of DISTINCT

When the clause GROUP BY is used without an aggregation function in the SELECT clause, it has a similar effect to DISTINCT.

For example, code #3a can be rewritten, replacing DISTINCT with GROUP BY:

-- código #3a
SELECT DISTINCT NomeVendedor 
  from Vendas;


-- código #9
SELECT NomeVendedor
  from Vendas
  group by NomeVendedor;
 

The return of código #9 is

Antônio
João
José
Maria

The implementation plan is as follows::

inserir a descrição da imagem aqui

That is, the same result and the same execution plan were generated for the two queries.


5. Partial considerations

The question of this topic served as the basis for writing article on the subject. In the references item at the end there is the link to the full article.

In this text the table Vendas is the type heap and without indexes nonclustered. This type was chosen to demonstrate the conceptual functioning of GROUP BY and DISTINCT because the presence of indexes may change the execution plan generated.

DISTINCT and GROUP BY do not do the same thing, have different goals and usually generate different execution plans.

There are exceptions.


6. References

6.1 Documentation

6.2 Full Article


  • Well prepared and explanatory answer

  • @Ricardopunctual When I read your answer, I voted for it and added comment to the question. But as the author requested something more detailed, I decided to write an article about it, which will soon be published. Part of the article has been transcribed to this topic.

  • It was really good. I read a lot but had not found anything complete in Portuguese

  • @Josédiz his is more explained, already marked as correct. Thank you

  • 1

    Practically the first part of a monograph (+1)

  • Added item 6.2, with link to the full article.

  • Excellent explanation, helped me understand

Show 2 more comments

25

They are very different proposals. While DISTINCT aims to bring unique information by removing duplicates, the GROUP BY groups the values, and is using them in conjunction with the aggregation functions, such as COUNTand SUM.

His example did not make good use of the GROUP BY, note that it does not use aggregation functions:

Select Nome, Sexo from Pessoa
Group by Nome, Sexo

Try for example to count how many records you have for each sex. A query below makes it very simple:

select Sexo,Count(Sexo) from Pessoa
Group by Sexo

Now try to do this with the DISTINCT, the answer is: it is not possible. That’s the difference between them.

In the case of your example, as the goal is only to get the values not duplicated, the DISTINCT is the best option.

As for performance, you need to have a good mass of data to benchmark and see the difference, but I believe that due to functionality, the GROUP BY for your example should not be the most recommended.

EDIT: searching, I found the following link in English that states that both generate the same execution plan, but it is kind of old, It may be that something has changed to the most current versions: https://stackoverflow.com/

  • In addition, GROUP by allows access to all omitted records, you can use aggregation functions such as group_concat for example to concatenate all data from a given field omitted in the cluster.

  • 1

    @Leonancarvalho yes o group_concat help, just remembering that it is a function of the Mysql, I don’t recall existing on SQL Server nor in the Oracle

  • True, in the oracle the similar function is LISTAGG (11g+) which uses another grouping system. I think my example was not the most common, perhaps SUM and COUNT be the most universal.

  • 1

    @Leonancarvalho na vdd o GROUP_CONCAT also does not work in MSSQL.

  • @Ricardopunctual grateful for the answer, already left my +1.

  • 2

    Only as complementary information: string_agg() function is available as of 2017 version of SQL Server.

Show 1 more comment

19

They have semantic differences, even if they have equivalent results in their specific data.

GROUP BY allows you to use aggregate functions such as AVG, MAX, MIN, SUM, and COUNT. Other hand DISTINCT simply remove duplicates.

For example, if you have a lot of shopping records and want to know how much you spent each department, you can do something like:

SELECT departamento, SUM(valor) FROM compras GROUP BY departamento

This will give you one line per department, containing the name of the department and the sum of all valores on all lines to that department.

In the SQL Server there is no difference in performance as it will result in the same implementation plan.

"A DISTINCT and GROUP BY usually generate the same query plan, so the performance should be the same in both query constructs".

Source: SQL Server Group by vs Distinct, Distinct vs Group By

In the Oracle is the same situation, there is no difference.

Source: DISTINCT vs, GROUP BY

Searching further, I found a site where it shows that depending on the situation the group by becomes more performatic than the distinct, that depends a lot on the subquerys.

As there is much explanation, check the link:

Performance surprises and premises: GROUP BY vs. DISTINCT

  • 1

    Thanks for the reply, I’ve already left my +1.

13

A little shorter answer:

  • Distinct is used to filter unique records of records that meet the query criteria.

    SELECT DISTINCT cliente FROM Pedidos;
    
  • The GROUP BYis used to group the data on which the aggregation functions(GROUP_CONCAT, COUNT, SUM for example) are triggered and the output is returned based on the columns of the group by clause.

    SELECT
      cliente,
      count(*) as 'Total Pedidos',
      SUM(total_pedido) as 'Total em compras'
    
    FROM Pedidos GROUP BY cliente;
    

It is difficult to measure which is faster, if you want unique values from a given table o DISTINCT will be more performatic, but what is also not an absolute rule. Depending on the indexes and the form with the DBMS implements this query the performance may vary.
Most of the time GROUP BY will be more efficient in operations involving columns other than the one being grouped or distinguished and will not be distinguished from the DISTINCT in single unit operations as shown in the other responses.

  • Leonan, already left my +1, your reply is excellent. Thank you very much.

0

As far as I know (at least following the basics of database and not each manufacturer’s implementations), DISTINCT algorithms and GROUP BY are identical in their initial stages following first a ordering and subsequent elimination of repetitions. The difference between the algorithms is that GROUP BY allows aggregation functions as a step additional while DISTINCT does not allow it. When using GROUP BY without these functions, we will be "repeating" exactly the phases initials that DISTINCT uses.

  • 3

    It was you who wrote the answer on MSDN?

  • 3

    When picking up a full text it may be interesting to quote and reference the author(link at least) and who knows how to add some personal considerations.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.