How does an index improve the performance of a query?

Asked

Viewed 540 times

8

Why does creating an index improve the performance of a query? Is any change made to the table structure? Is the data structure for disk storage modified? Is the algorithm used to perform the search different? What enables performance improvement?

Show 3 more comments

2 answers

11

It's Magic

It is pure mathematics!

I won’t go into detail (it’s true :) ), specific doubts can be resolved in new questions.

Binary search

The index of a database is not at all different from a book index, or even from a dictionary which is itself an index.

How will you find the word you want in the midst of many thousands of words? You know the words are in alphabetical order.

By the first letter you already try to open on a next page. So if it’s "C" it will open more or less at the beginning, if it’s "N" it will open more or less in the middle and if it’s "S" it will open near the end.

There you go around looking at the page that first word and see if what you’re looking for is further or further back from where you are. Do this successively until you find a page that is very close to the word sought.

Then you look inside the page for the word. You don’t have to look one by one, you can do the same kind of research you were doing on the pages. Search more or less at the most likely place the word is on the page and goes up or down if you approached the word you want or walked away.

This type of search is a "divide to conquer" algorithm. If you eliminate part of what you are looking for it is easier to find. If you are in some order it is easy to eliminate a part knowing that what you are looking for is further forward or further back.

Well, in computing doing just that of a well-structured and precise one. When we have a collection data in order we start to search well in the middle of the data (maybe it is worth looking at the first and who knows the last element of this data collection).

If you look in half a piece of data you can see if what you’re looking for is bigger or smaller than what you found in the middle (obviously if what’s right in the middle isn’t the same). It’s a simple comparison that everyone always does, no matter if the data is a number, a date, a text, etc. You can use the < or > It’s on. If what you’re looking for is bigger then you discard the first half of the collection and look at the second half now.

But it will not look at everything in the second half. Look only there in the middle of this half, ie in the 3/4 position of the collection. If whatever’s in there is the same, great, you’ve already found what you want. If it is smaller then you discard this final part and look only 1/8 of the collection between half and 3/4 of it.

And then do it again, taking 1/16 of the collection. While not finding go repeating picking 1/32, 1/64, 1/28.

This is called binary search. You can find information in 2 billion elements of a data collection in just 32 attempts (2 high 31 is about 2 billion, you have to take one that is the initial attempt). You can find it in fewer operations, you can find it on the first try. As a curiosity, if you have 10 million elements in the collection it will take at most only 21 attempts. For 4000 it takes up to 13 attempts. 1 trillion takes 41 tries. It’s fast?

This is considered as algorithmic complexity O(log n) where the time spent will be the logarithm of N, N being the amount of elements of the collection.

Binary search has a downside. The collection needs to be always in order. So we have gained big in the search. But it would be a tragedy to insert, modify the data or remove it from the collection keeping order. You would have to rebuild the entire collection in each operation of this. Completely unviable.

Binary tree

So we need to organize the collection in a way that doesn’t need to rebuild it in every modification. Or at least we have to make the reconstruction be minimized. Now, again we fall into a situation of "divide and conquer".

So we need a data structure that divides the collection into several parts, so every time I modify an element I have to modify only the part that it belongs to. If I divide by 2, I already have to rebuild only half of the collection, if I divide by 4, I only need to rebuild 1/4 of the collection, if I divide it further I can have 1/8, 1/16, 1/32 of the collection. I am doing this until it is few and easy to modify. I can get to the point that there is only one element.

This is called binary tree. It is a data structure that starts with a root and goes distributing in pairs of branches and goes in levels of branches until arriving in a leaf that finishes the tree.

Entering the first key is placed at the root. When a new key arrives it compares to see if it is larger or equal to the key at the root, if it is larger it connects to a node from a list to a side. If it’s smaller, he puts it on the other side.

The ideal is that at the root I had a value that was right in the middle of all possible data. But just like in the dictionary that we "never" open on the right page, the content doesn’t need to start at the best point.

When it comes more data it goes making the comparison with the root and then with one of the branch, if it is Mark that the root it goes to one side and ignores the other. Then this new key goes to one side or the other of this branch. And it repeats itself. It makes a specific pyramid (although it can be difficult on one side).

A tree looks something like this:

Árvore binária

But the most common is to come all disorganized and the tree doesn’t look so perfect:

Árvore desbalanceada

So if you have 32 levels of branches you can accumulate 4 billion keys (2 to the 32). The worst case to get to any key is to walk through the 32 branches. It’s very fast, and there’s the magic.

Structured in this way, the data collection can be inserted almost at the same time as the search.

The tree is like a linked list with the advantage of divide and conquer. This way you do not have to rebuild the whole collection.

And you have another advantage. If you want to "list" the entire collection in the order of the key (the information that is used to sort the data records) just go out browsing the tree node by node and it will be in order.

There are an immense amount of known algorithms that implement binary tree, some:

Each has its advantages and disadvantages. Some are very naive, others are very difficult to implement. B-Tree is the most basic and virtually discarded by databases.

I answered about Tree B. This type of tree (some variation) is much more used in indices than the purely binary tree. It is easier to deal with multiple branches and leaves than to keep everything straight divided into 2.

Balanced tree

It’s not all flowers. There are cases that the tree leans to one side, so you can have this 4 billion keys and on one side have 2 or 3 levels and on the other have tens or even hundreds of levels. It still gets very fast, but in an extreme case you can have 4 billion levels and the search stay sequential.

And it’s not that hard to have that imbalance. If all the data comes in order, that’s exactly what happens. An ideal index should have as few levels as possible. To achieve this a balancing is necessary. It is already a more complex algorithm.

So the insertion of a new die cannot be so simple. Ideally you need to rebalance the branches to try to get all levels aligned. This can be done at each insertion, can be done in blocks, can determine the extent of rebalancing since rebalancing one branch can unbalance another one higher up. Of course, it won’t rebalance everything at each insertion because this would potentially rebuild the entire collection. You can live with some unbalancing for a while, so you don’t have to tidy up all the time. And the data can come in a pattern that does not unbalance much. The ideal is always to come a bigger (or equal) and a smaller for each branch.

Wikipedia article.

Nodes and pages

Well, I mentioned us (nodes), which is where each element is. It’s actually more advantageous to organize these nodes into pages with several of them. This grouping is useful because the main memory (RAM) and mass memory (HDD/SSD) work with pages, so since you are going to access the page, have a lot of information there for you.

It would be very complicated, would end up getting slow and would take up too much space to treat knot by knot.

The pages can not be too small (in general it does not compensate less than 4K because this is the size of the pages for the operating system) nor too large because these pages need to be copied together in some situations, need to be allocated by the cache, need to be stopped when there is concurrent access.

Knot composition

Think about it, a dictionary is an index. It has an order. And if you wanted to see all dictionary entries in another order (I know it’s rare to need this), what to do?

In the dictionary the main data gets confused with the index. But it doesn’t have to be that way. You can have the main data, which is still an index and is often called Primary index (and is always a coverage index - see below, that is, the table) and is usually assembled sequentially by a simple key and controlled by the database (ID) and have several secondary indices.

In these indexes you have only one pair with a key (the data that must be put in order) and a pointer to where the record is (the primary index). In general the key is usually larger than the pointer. The pointer is usually the ID.

It has database that has fixed size and the record number cannot be changed, so the primary index can be greatly simplified. But this is detail.

Storage and cache

These pages can be in the RAM or in the "disk". Much accessed data usually stay longer in RAM. This is the case of indexes. it is accessed frequently and needs to access as soon as possible. It has a number of strategies to determine what "falls best" in the cache, but that’s beside the point.

These pages can always be in the same place (requires locking for concurrent access) or can be copied every time there is a change in it (copy on write). This is a technique called MVCC and dispenses with locking, but need to collect garbage (no more necessary pages).

The order these pages are in memory or disk does not matter. It needs to have an easy way to find it. And guess what’s used to find the pages quickly? Binary tree. Of course it depends on the database, but basically we have a tree controlling where that information is.

Page access is very random, so an SSD absurdly helps the performance of a database if compared to an HDD. SSD access is effectively random, like RAM. HDD access is random-sequential, after all it is a disk that keeps rotating sequentially. I think you imagine that on tape the database is unviable because it’s completely sequential.

Think of each database page as a book page. The index will tell you which is the first key of each page. Right now you don’t need to find the key anyway, you just need to know what page you’re on. So the key has to be bigger than the first key on this page and smaller than the first key on the next page, considering that they are properly classified.

First find out what page you are on and then look inside the page. Generally look inside the page and you don’t even need a tree.

Some data may be larger than a page, so there are special pages that are linked to that main page.

If you’re having trouble understanding these links it’s because you still don’t understand how lists work.

Other types of index

This is triggered the index most used by databases. But other forms are used as well. Most as a way to optimize certain data or search patterns.

Hash

There are scenarios that a index hash (because it uses a table hash) can be more interesting. It has complexity O(1), which although it is nothing absurdly better usually wins O(log n). (1) That’s right, no matter the size it finds what it wants as if it had 1 element in it. For comparison purposes the array is O(1). Of course, this O(1) is not the worst case, but the worst never happens and even the "a little bit worse" rarely happens. Ideally your keys should be unique.

It has a disadvantage. There is no order in it. If you need the data in some order, it does not provide. The order is always arbitrary. It’s not for listings. But it’s not just that. If you need a rebalancing it can be quite time consuming and make it impossible to access the index for a very long time.

More information.

Inverted index

When you have the same data repeated in several database record it is common to be more interesting the use of inverted index. Unlike the normal index that you have a key and a note, this type of index has a key and multiple notes. It indicates all records of the database that holds that key.

Eventually it can be other information to help the best ranking.

There are some domains that this is quite interesting, but one that is quite common is the textual search. Any application that searches text in general can have huge performance gains with its use. Be a data search for a company, be a super giant mega hyper web search engine.

That’s usually called full text search.

More information.

Coverage index

The covering index is just an optimization, where it already stores a lot of information that will normally be used when applying that index, so it does not need to consult the main table and saves access to mass storage that is quite slow.

Indexes mapped per bit

Bitmap index is useful when you need to search for a particular condition, you just want all database records that satisfy that specific condition (can’t even have parameters) to be returned. Like any condition, it can only give true or false. Instead of using keys and notes, a sequence of bits is mounted. Each bit is mapping to a record in the database in a given order. So if you want to know if the 5482nd record meets that condition just scroll 685 bytes and check the second bit of the next byte.

It is fast, occupies very little space, but only serves for conditions.

There are cases of the database creating an index mapped in memory at the time you are doing a search to join all records that it should return to your query. It is an internal optimization.

Sparse index

Sparse index is more an optimization.

There are cases where it is more interesting to point the index to a data block and not to the specific data. The index can be shortened like this. It is useful in cases of a lot of repetition of keys. The search for repeaters tends to be sequential, so do not keep notes for all elements, point to the first record and from there already makes the sequential search.

Multilevel index

Multilevel index is an optimization where an index points to another index that points to another index. This way you can segment the indices. This can help fit more data into memory or separate the indexing into different storage volumes.

More

There are indexes that change just because they allow condition of what should be indexed or not. The structure and algorithm does not change, but some keys are not indexed, there is a filter.

There are a number of tricks that can be added to the index to give more power, flexibility and performance, but that does not change anything in the functioning of the index itself.

Other types:

  • 1

    Great bigown! Want to launch an eBook with your answers? It’s one lesson after another haha :P

  • @Laerte I want to write something yes, I just need to make time.

  • Will the name still be ddd? Haha, when releasing I will want to buy :p, just do not forget to put a chapter with the pearls of the master programmer @Onosendei with the credits clear kkk

  • @Marcelobonifazio This would be the blog, not because what would be the DDD already exists (is a dead blog, but already caught), and the DDD is not what is thought is :D

0

Content you can see as a summary from a book, in the book through the summary it is much faster to find something than you pick up the book and go searching page by page, isn’t it? : ) That’s the idea of the index.

Suppose you have a table of 1000 records, using the Dice when using the condition ..WHERE ID = 90, it does not need to search in the whole table, it takes the index and looks for this value, already without index it would have to search in the whole table until finding.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.