How index selectivity works

Asked

Viewed 722 times

13

Reading about indexes I recently came across the term "selectivity". Some places said more selective fields should come first on the index, others say no or that it depends.

I have the following table tbMovimentoConta abbreviation (mcon) with 5 million records, and the following columns that will be used in where

  • mcon_dataExclusion (99% of the records have the value null) because data cannot be deleted with delete from
  • con_codigo_fk (there are 5 distinct values(1,2,3,4,5) divided for the 5 million records)
  • mcon_data (1825 distinct dates for the 5 million records)

I have the following query that is generated by the Entity Framework:

exec sp_executesql N'SELECT 
[Project1].[C2] AS [C1], 
[Project1].[mcon_codigo_pk] AS [mcon_codigo_pk], 
[Project1].[mcon_data] AS [mcon_data], 
[Project1].[mcon_obs] AS [mcon_obs], 
[Project1].[mcon_valor] AS [mcon_valor], 
[Project1].[mcon_tipo] AS [mcon_tipo]
FROM ( SELECT 
    convert (datetime2, convert(varchar(255), [Extent1].[mcon_data], 102) ,  102) AS [C1], 
    [Extent1].[mcon_codigo_pk] AS [mcon_codigo_pk], 
    [Extent1].[mcon_data] AS [mcon_data], 
    [Extent1].[mcon_valor] AS [mcon_valor], 
    [Extent1].[mcon_tipo] AS [mcon_tipo], 
    [Extent1].[mcon_obs] AS [mcon_obs], 
    1 AS [C2]
    FROM [dbo].[tbMovimentoConta] AS [Extent1]
    WHERE ([Extent1].[mcon_data] >= @p__linq__0) AND ([Extent1].[mcon_data] <= @p__linq__1) AND ([Extent1].[con_codigo_fk] = @p__linq__2) AND ([Extent1].[mcon_dataExclusao] IS NULL)
)  AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[mcon_codigo_pk] ASC
OFFSET 764225 ROWS FETCH NEXT 11 ROWS ONLY ',N'@p__linq__0 datetime2(7),@p__linq__1 datetime2(7),@p__linq__2 bigint',@p__linq__0='2012-01-01 00:00:00',@p__linq__1='2013-01-01 23:59:59',@p__linq__2=1

I tested with two index settings

Index 1

CREATE NONCLUSTERED INDEX [ix_consulta_movimento] ON [dbo].[tbMovimentoConta]
(
    [mcon_dataExclusao] ASC,
    [con_codigo_fk] ASC,
    [mcon_data] ASC
)
INCLUDE (   
[mcon_dataInclusao],
[mcon_codigo_pk],
[mcon_valor],
[mcon_tipo],
[mcon_obs]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Index 2

CREATE NONCLUSTERED INDEX [ix_consulta_movimento] ON [dbo].[tbMovimentoConta]
(
    [mcon_data] ASC,
    [con_codigo_fk] ASC,
    [mcon_dataExclusao] ASC
)
INCLUDE (   
[mcon_dataInclusao],
[mcon_codigo_pk],
[mcon_valor],
[mcon_tipo],
[mcon_obs]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

In the tests I did, the index 1 was faster, but I thought the 2 would be faster. Could someone explain to me why the index 1 is faster, or if there’s something wrong with it, so the 2 would have to be faster.

Editing

Doing more tests here, it seems that the index 2 is faster even. Anyway I’m doubt. I need to have a clear understanding of how to sort these columns in the index.

  • 2

    It depends on what you need. If you put the ID At first you’ll have great selectivity, but if you want a group of information between dates, it doesn’t help at all. If you do it blindly you will end up creating dozens of indexes, most without real need and may have performance loss elsewhere.

  • I’ll edit the question. I took the trace query that the Entity Framework is generating, and I’ll complete the missing index includes Columns.

  • 2

    Perhaps an analysis of the consultation implementation plan can help you understand. The plan includes JOIN and WHERE strategies that can provide better answers than just the question of choosing the indexes.

  • 1

    I did that, actually he who suggested the index 1, I created the index two.

  • 2

    Those interested in understanding more about indexes have 3 books that are FREE and good. http://www.red-gate.com/library/sql-server-execution-plans-2nd-edition , http://www.red-gate.com/community/books/inside-sql-server-query-optimizer e https://www.simple-talk.com/books/sql-books/complete-showplan-operators/

  • I put in this answer a command to check missing Indice, see if it helps, and also always analyze the query plan, remembering that every Indice created has an impact on Insert, update and delete. https://answall.com/questions/173358/howto improve a-performance-de-leitura-de-um-banco-dados-sql/173899#173899

  • Then shows the execution plan, only with it can be sure if the query ran more or less fast.

Show 2 more comments

2 answers

1

"index 1 was faster, but I thought 2 would be faster"

To answer this question, we need to remember that an index is a list ordered by one or more fields with a pointer pointing to the corresponding record in the table.

As it is ordered, the search is much faster, avoiding the "table scan" which happens when we search for an unordered field, that is to say a complete query in the table.

The answer to your question is very simple, especially with an important detail: "99% of records have null value", i.e., null records will all come in sequence in the index, which means that when searching for a value, it will limit the search to only 1% of the index, making it very fast.

To try to illustrate, imagine that the index was something like this:

inserir a descrição da imagem aqui

That is, any date will be in that "small" part of the index, the whole "null" part will be skipped, resulting in a much faster search, because the field mcon_dataExclusao is soon the first of the index, greatly limiting the search.

0

The indexes serve to facilitate when selecting the data that will be returned by the query. For composite indexes, that is, those that present more than one column in their structure, the order of them serve to improve the SCAN process in the tables that the DBMS does at the time an operation is performed on it. There are some criteria and guidelines for ordering them following the concept of selectivity:

  • Columns that appear in the WHERE clause should occupy an initial top position;
  • Columns that appear most frequently in the WHERE clause, order these columns in the index so that the selected ones most often appear in a higher position;
  • If the columns appear at the same frequency, but the data is physically ordered in one of the columns, place it in a higher position;
  • If the column is very sparse, that is, it has many NULL values in relation to the others, it must be ordered in a lower position in relation to the others.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.