How to create multiple entries in an index based on columns in the same row?

Question

How to create multiple entries in an index based on columns in the same row?

Asked 11 years, 6 months ago

Viewed 1,569 times

12

I have never found a good way to index multiple columns of a row as index entries or simulate this feature in Mysql.

The problem arises when you have fields working as tags or a similar concept. E.g. columns with names tag1, tag2, tag3. To search lines with a tag specific quickly requires you to have 3 indexes and make 3 separate queries in the most basic and obvious way.

Is there any way to index these three fields as entries of a single index that allows you to perform only one search.

Trying to illustrate how it would look

ID tag1 tag2 tag3
-- ---- ---- ----
01 abc  xyz  bla
02 foo  bar  ble
03 xyz  bla  bar

Como o índice se comportaria:

abc -> 01
bar -> 02 03
bla -> 01 03
ble -> 02
foo -> 02
xyz -> 01 03

So if I do a search for "bla", I will have access to Ids "01" and "03".

Is there any other way to do this efficiently? Even if it changes the presented structure.

Is the goal to query the X tag and return the Ids with it, or a query that returns the entire "index"? And do you want a solution to that specific table structure, or are you open to review the structure?

– bfavaretto

2014/01/30 at 04:56
@bfavaretto edited to answer your questions

– Maniero

2014/01/30 at 05:01

3 answers

12

Some time ago (2013? 2012?) I developed a system very similar to the question. I had a few million objects and a dozen "tags", and each object could have 0 or more tags associated. I had to filter these objects based on searches for tag sets. Similar, no?

As the number of tags per object, in my case, was theoretically unlimited (since new tags could be added to the system after release), the solution proposed in the question did not suit me, i.e., I could not use a table with columns "tag1", "tag2"etc. Also, this scheme does not allow use of indexes (see my first comment in the @mgibsonbr reply).

As I needed a lot of performance (i.e., queries answered in "seconds"), at the time I made a comparison between several solutions, including the two proposals by @mgibsonbr in his reply.

Next, my results -- if my memory does not fail!

Trade offs...

To @mgibsonbr’s "1." solution has the disadvantage of possibly occupying too much disk space (since you will have the "characters" of the tags repeated numerous times in the whole table). This is a disadvantage as it forces your database to have to read many "pages" from your hard drive, so you have to turn the disk around a lot and move the reading head a lot, which can have great latency. The advantage is that you make only 1 select to get your result.

Already the @mgibsonbr’s "2." solution uses less disk space (because in the giant table only tags ids -- will be saved and if you use the numeric type of size appropriate to the maximum amount of tags, you can reduce to 4, 2 or even 1 byte per line). Thus, you can read more lines per "page" read from disk reducing latency. In contrast, your select would probably have a Join:

select from tags_objects, tags
 where tags.id = tags_objects.tagId
   and (tags.name = 'tag-buscado-1'
     or tags.name = 'tag-buscado-2') -- etc...

This Join is to blame for performance problems in this solution.

Most efficient solution (in my case of specific use)

In the end, the most efficient solution I was able to use was the "2." solution with 2 different selects. The first select searches for the tag ids, and the second select uses the tag ids in the giant table. It’s like I did Join "manually".

This was advantageous for me because, in my case, I was able to cache, in my application, the tags ids. This cache was updated by a background thread (doing a "full scan" on the lowercase table that contains tags and their ids every "X" second). At the end of the day, in practical terms, the "synchronous" calculation was only a select in the giant table with the column "tagId" being some numerical type, therefore smaller than having to do joins.

Obviously, for performance reasons, it is necessary to place an index in the "tagId" column of the giant table.

Before implementing all this solution, my queries lasted ~1min or ~2min with, if I’m not mistaken, about 5 tags. After all this, I was able to reduce the time of the queries to something around ~10s!

Considerations

It is quite complicated to analyze beforehand which solution will perform best in this case, because it really depends on the characteristics of your project. I hope that this response will give some guidance in your search for the most efficient solution for your specific case.

Excellent, I still expect other answers but both yours and mgibsonbr helped me a lot.

– Maniero

2014/01/30 at 11:28

Browser other questions tagged mysql database optimizing índices

You are not signed in. Login or sign up in order to post.

by mgibsonbr • **80,631** points · Answer 1 · 2014-01-30T05:03:05+00:00

I can’t talk about efficiency, but one way to query for a value using a single query would be to use the IN in an unconventional way - with the columns on the right. Example:

select * from minha_tabela
where 'foo' in (tag1, tag2, tag3)
  and 'bar' in (tag1, tag2, tag3);

Source: the book "SQL Antipatterns", in English.

P.S. According to this same book, modeling the data in this way (Multi-column Attributes) is a bad practice, and causes several problems beyond that of indexing (it also complicates insertion, removal, make each tag only appear at most once per line, etc.). The ideal would be to create a dependent table (or a intersection/junction table, if each tag is more than a simple string) to associate the tags to the table in relation N to N. Unless you have your reasons for maintaining this design, I suggest changing it.

Updating: as the question has been edited so as to admit to changing the structure presented, I will complement with examples of dependent table and intersection table:

Dependent table
```
PK                Índice  FK, Índice
ID Etc            Tag     ID_Tabela
-- ---            ---     ---------
01                abc     01
02                xyz     01
03                bla     01
                  foo     02
                  bar     02
                  ...
```
In this case the tags have been moved to a separate table, where each tag is associated with only one row in the original table. The "tag" column is indexed, so the query Tag -> IDs is fast. And the column "id_table" - foreign key to the original table - is also indexed, so that the query ID -> Tags also be quick.

(Note: the "tag" column is not UNIQUE, because each tag can appear more than once in the table.)

Table of intersection/junction

PK                FK        FK              PK
ID Etc            ID_Tabela ID_Tag          ID Tag
-- ---            --------- ------          -- ---
01                01        01              01 abc
02                01        02              02 xyz
03                03        02              03 bla
                  01        03              04 foo
                  03        03              05 bar
                  ...                       ...

In this case (useful if you want to store more information about a tag, or maybe in a single query change the name of the tag without touching the associations) the tags have been moved to a separate table, and an intersection table relates the two (in a relation from N to N). Only add indexes where necessary, depending on the specific queries you intend to make.

These two techniques have also been adapted from the book, and in general are good practices that also maintains its standard model. For performance issues - which I have no say in, because I have never operated such a system at scale - where denormalization or other optimization techniques are admitted (caching, offload in the application layer, etc) see for example the response of @Bruno Reis.

by Raul Almeida • **1,512** points · Answer 2 · 2014-01-30T12:20:41+00:00

Query explicada

In Postgresql if you create an Indice with multiple columns as the example of the figure and run the query also as illustrated it will be able to use the Indice for the query and will handle alone to solve the query efficiently when executing a bit map in the results partial calculated.
Just know the Mysql planning engine will treat the execution of the query in the same way.