How to create multiple entries in an index based on columns in the same row?

Asked

Viewed 1,569 times

12

I have never found a good way to index multiple columns of a row as index entries or simulate this feature in Mysql.

The problem arises when you have fields working as tags or a similar concept. E.g. columns with names tag1, tag2, tag3. To search lines with a tag specific quickly requires you to have 3 indexes and make 3 separate queries in the most basic and obvious way.

Is there any way to index these three fields as entries of a single index that allows you to perform only one search.

Trying to illustrate how it would look

ID tag1 tag2 tag3
-- ---- ---- ----
01 abc  xyz  bla
02 foo  bar  ble
03 xyz  bla  bar

Como o índice se comportaria:

abc -> 01
bar -> 02 03
bla -> 01 03
ble -> 02
foo -> 02
xyz -> 01 03

So if I do a search for "bla", I will have access to Ids "01" and "03".

Is there any other way to do this efficiently? Even if it changes the presented structure.

  • Is the goal to query the X tag and return the Ids with it, or a query that returns the entire "index"? And do you want a solution to that specific table structure, or are you open to review the structure?

  • @bfavaretto edited to answer your questions

3 answers

12


Some time ago (2013? 2012?) I developed a system very similar to the question. I had a few million objects and a dozen "tags", and each object could have 0 or more tags associated. I had to filter these objects based on searches for tag sets. Similar, no?

As the number of tags per object, in my case, was theoretically unlimited (since new tags could be added to the system after release), the solution proposed in the question did not suit me, i.e., I could not use a table with columns "tag1", "tag2"etc. Also, this scheme does not allow use of indexes (see my first comment in the @mgibsonbr reply).

As I needed a lot of performance (i.e., queries answered in "seconds"), at the time I made a comparison between several solutions, including the two proposals by @mgibsonbr in his reply.

Next, my results -- if my memory does not fail!

Trade offs...

To @mgibsonbr’s "1." solution has the disadvantage of possibly occupying too much disk space (since you will have the "characters" of the tags repeated numerous times in the whole table). This is a disadvantage as it forces your database to have to read many "pages" from your hard drive, so you have to turn the disk around a lot and move the reading head a lot, which can have great latency. The advantage is that you make only 1 select to get your result.

Already the @mgibsonbr’s "2." solution uses less disk space (because in the giant table only tags ids -- will be saved and if you use the numeric type of size appropriate to the maximum amount of tags, you can reduce to 4, 2 or even 1 byte per line). Thus, you can read more lines per "page" read from disk reducing latency. In contrast, your select would probably have a Join:

select from tags_objects, tags
 where tags.id = tags_objects.tagId
   and (tags.name = 'tag-buscado-1'
     or tags.name = 'tag-buscado-2') -- etc...

This Join is to blame for performance problems in this solution.

Most efficient solution (in my case of specific use)

In the end, the most efficient solution I was able to use was the "2." solution with 2 different selects. The first select searches for the tag ids, and the second select uses the tag ids in the giant table. It’s like I did Join "manually".

This was advantageous for me because, in my case, I was able to cache, in my application, the tags ids. This cache was updated by a background thread (doing a "full scan" on the lowercase table that contains tags and their ids every "X" second). At the end of the day, in practical terms, the "synchronous" calculation was only a select in the giant table with the column "tagId" being some numerical type, therefore smaller than having to do joins.

Obviously, for performance reasons, it is necessary to place an index in the "tagId" column of the giant table.

Before implementing all this solution, my queries lasted ~1min or ~2min with, if I’m not mistaken, about 5 tags. After all this, I was able to reduce the time of the queries to something around ~10s!

Considerations

It is quite complicated to analyze beforehand which solution will perform best in this case, because it really depends on the characteristics of your project. I hope that this response will give some guidance in your search for the most efficient solution for your specific case.

  • Excellent, I still expect other answers but both yours and mgibsonbr helped me a lot.

9

I can’t talk about efficiency, but one way to query for a value using a single query would be to use the IN in an unconventional way - with the columns on the right. Example:

select * from minha_tabela
where 'foo' in (tag1, tag2, tag3)
  and 'bar' in (tag1, tag2, tag3);

Source: the book "SQL Antipatterns", in English.

P.S. According to this same book, modeling the data in this way (Multi-column Attributes) is a bad practice, and causes several problems beyond that of indexing (it also complicates insertion, removal, make each tag only appear at most once per line, etc.). The ideal would be to create a dependent table (or a intersection/junction table, if each tag is more than a simple string) to associate the tags to the table in relation N to N. Unless you have your reasons for maintaining this design, I suggest changing it.


Updating: as the question has been edited so as to admit to changing the structure presented, I will complement with examples of dependent table and intersection table:

  1. Dependent table

    PK                Índice  FK, Índice
    ID Etc            Tag     ID_Tabela
    -- ---            ---     ---------
    01                abc     01
    02                xyz     01
    03                bla     01
                      foo     02
                      bar     02
                      ...
    

    In this case the tags have been moved to a separate table, where each tag is associated with only one row in the original table. The "tag" column is indexed, so the query Tag -> IDs is fast. And the column "id_table" - foreign key to the original table - is also indexed, so that the query ID -> Tags also be quick.

    (Note: the "tag" column is not UNIQUE, because each tag can appear more than once in the table.)

  2. Table of intersection/junction

    PK                FK        FK              PK
    ID Etc            ID_Tabela ID_Tag          ID Tag
    -- ---            --------- ------          -- ---
    01                01        01              01 abc
    02                01        02              02 xyz
    03                03        02              03 bla
                      01        03              04 foo
                      03        03              05 bar
                      ...                       ...
    

    In this case (useful if you want to store more information about a tag, or maybe in a single query change the name of the tag without touching the associations) the tags have been moved to a separate table, and an intersection table relates the two (in a relation from N to N). Only add indexes where necessary, depending on the specific queries you intend to make.

These two techniques have also been adapted from the book, and in general are good practices that also maintains its standard model. For performance issues - which I have no say in, because I have never operated such a system at scale - where denormalization or other optimization techniques are admitted (caching, offload in the application layer, etc) see for example the response of @Bruno Reis.

  • 1

    It summarized well what I was going to say. I just add that a composite index in the 3 columns would be used in a query like this. Toast: sqlfiddle I had ready: http://sqlfiddle.com/#! 2/13f8a

  • @bfavaretto If you have something to add regarding efficiency, go ahead, because even with the index I wouldn’t know if my proposed SQL is efficient or not.

  • 1

    @bfavaretto, by chance your comment refers to the original answer, before the first edition (i.e., select ... where ... in)? If yes, a single composite index in the 3 columns could not actually be used. The reason is simple. Suppose your index contains the columns in the order (tag1, tag2, tag3). Now suppose a line with tag1 = xxx, tag2 = foo, tag3 = bar. In this condition, the entire table must be traversed (i.e., full scan) to locate this line, since none of the select fixed tag1 = xxx.

  • @Brunoreis, that’s right! I understand very little of indexes, so I did not want to make an assertion, but already suspected that my code posted would need a full scan - with or without indexes...

  • @mgibsonbr, regarding performance, it is difficult to say beforehand which of the solutions (i.e., 1-N or N-N, which you called "1." and "2." in the edition) is more efficient. About 1 year ago I developed a system very similar to this in which I had a few million "objects" and a dozen tags (so ~20M to ~30M of entries in a relationship table). The "1." solution spends more disk space, so you’ll probably have more reads, which can slow down. On the other hand, "2." needs Join. [continues to follow]

  • @mgibsonbr, [continuation] The most efficient solution I found in the end was to use "2." but with 2 distinct selects: in the first select, you get the ids of the tags you search for. In the second select, you filter by "Where tagId = XXX or tagId = YYY" etc to XXX, YYY, etc being each of the tag ids you found in the first select! With this, I got reductions of the order of 90% in the query execution time.

  • @mgibsonbr, finally, if the tags are not "believable" (i.e., different from how the OS works for example where you can create tags at any time), the most efficient solution, for sure, is the "1." with the tags in an "ENUM" field-- since, internally, the lines will only contain references to ENUM values instead of containing the "text" of the tags; this avoids excessive reads because it reduces the size of the table and also avoids the need to make 2 selects!

  • 2

    @Dude, put that as an answer! The practical experience of someone who has been through a situation is worth much more than the opinion of someone who only knows the theory...

  • 1

    @mgibsonbr, posted :-). If you want a reading tip on indexes, Mysql documentation is excellent. In Mysql 5.6, see the following pages of the manual: 8.3.1, 8.3.4, 8.3.5 and 8.8.3. It may also be useful to study B-Trees in some data structure text to understand Mysql indexes more deeply (for example in Sedgewick’s Algorithms).

  • @Brunoreis Yes, I was referring to the original answer. I understand the question of the index, since the condition does not have an AND of the values of the 3 columns. But EXPLAIN said "using index," so I’m confused.

  • @bfavaretto, humm, curious, at night I will investigate further in depth!

  • 1

    @bfavaretto, as promised, I tested further. I managed to create a situation where EXPLAIN actually says "using index", but in fact, this does not mean that the query is being accelerated. Insert tens of thousands of lines so that only a small fraction of them match the query. Then run explain. See the "Rows" column, it will show the total number of rows in the table -- the whole table must be traversed! In this case, it is practically indifferent for Mysql to use the index or not, as it will do full scan of the table or index. [continue]

  • 1

    @bfavaretto [continuation]. Now, I bet with you that in the query you ran in explain, you only asked in the result set the tags and, at most, the key Primary. That’s precisely why Mysql can use the index in this case: the index contains all the columns you ask for in the result set (the 3 tags plus the Primary key). Now, try asking for another column, I don’t know, a "name", in your query. If you don’t have that column, add it. You will see that now, through EXPLAIN, Mysql does not even think about using the index. Therefore, in fact, the index does not accelerate in this query by tags! It was clear the pq?

  • 1

    Sqlfiddle with example: http://sqlfiddle.com/#! 2/e5a26/2 More info: http://dev.mysql.com/doc/refman/5.1/en/explain-output.html#jointype_index

  • I went back in the my sqlfiddle, @Brunoreis, and you’re absolutely right. The using index some if you add another column in SELECT. And in fact it does full scan. It was clear yes, thanks for the explanation!

  • 1

    Cool! Just one last detail: there is a reason why, in extreme cases, it can actually be a little faster to run the query by the index instead of running it by the normal table. The case is if you have too many columns with too much data. Full scan in the table will cause too many reads on the disk, which is slow. On the other hand, full scan on the index will cause less reads, as the index does not contain the "extra" columns. If in your result set you do not ask for data from other columns (except the Primary key), then select by index is faster because of the less time spent reading disk!

Show 11 more comments

7

Query explicada

In Postgresql if you create an Indice with multiple columns as the example of the figure and run the query also as illustrated it will be able to use the Indice for the query and will handle alone to solve the query efficiently when executing a bit map in the results partial calculated.
Just know the Mysql planning engine will treat the execution of the query in the same way.

  • This is exactly how I do in Postgresql :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.