How to model a tree data structure using a relational database?

Asked

Viewed 6,498 times

42

How to appropriately and efficiently define naturally organized tree-based data in relational databases, considering the physical implications of this? That is, organize in a way that minimizes the accesses made in any node.

Assume that RDBMS does not have special features to handle this except the DDL of ANSI SQL or features available in all key relational database systems. Possible optional specific optimization may be useful, but not mandatory.

If there is a way (query) better suited to access (scan/traverse) structured data in this way, an example would help.

3 answers

35


Well, it is always preferable to use a DBMS that handles data on tree in a natural way, even better when it is specially designed for such purpose, as for example, the Neo4j.

But there are some ways to work on relational DBMS, I will list 4 of them:

1. Adjacent List

Most commonly used solution, each input (record) knows only its top node, see an example of commonly used structure:

Example: Comments on a Blog

In this example, there is a table that stores comments from a blog, there are the main comments, which has no top nodes (they are the "root"), and there are the comments that are responses to another comment, and so on, in a DBMS table, we would have the following situation:

Table: comentarios

id          parent_id        usuario_id        comentario
----------------------------------------------------------------
1           NULL             33                Gostei do seu...
2           1                34                seu comentario foi exc...
3           1                99                Gostei Também...
4           3                34                Que bom que gostou...
5           NULL             80                Que Post Bacana ...

Rendering, we would have something like:

Usuario 33 Comentou: Gostei do seu ...
      Usuário 34 Comentou: Seu comentario foi exc...
      Usuario 99 Comentou: Gostei Também...
            Usuário 34 Comentou: Que bom que gostou...
Usuario 80 Comentou: Que Post Bacana ...

Comments have the same data structure, however they are nested in this way, it is necessary to recursively traverse all data from a certain ID.

POSITIVE POINTS: Easy to implement
NEGATIVES: Difficult to handle in arvores deep, works well when there are few levels

2. Path Enumerated

Basically the same structure as the example above, but the table contains the path from the root node to itself, see the example:

Table comentarios

id          path_to_comment  usuario_id        comentario
----------------------------------------------------------------
1           /                33                Gostei do seu...
2           /1/2             34                seu comentario foi exc...
3           /1/3             99                Gostei Também...
4           /1/3/4           34                Que bom que gostou...
5           /                80                Que Post Bacana ...

this way we could take the comments below the comment of ID 3 doing something like

SELECT * from comentarios WHERE path_to_comment LIKE '/1/3/%';

POSITIVE POINTS: Easy to implement, faster and more efficient consultation than the Lista Adjacente
NEGATIVES: It is relatively difficult to retrace the path when relocating an element

3. Nested sets

It’s a slightly more complex method, I’m going to give you an overview, because a full answer would greatly prolong a response that is an overview of the methods, so let’s go:

They consist of storing with each node, two numbers (one to the left and one to the right): the one to the left stores a smaller number than the smaller ID of their descendants, and the number to the right stores a larger number than the larger ID of their descendants; then, when making a query, we would have a search scope that in terms of performance would be much more efficient than the other methods, see an image that illustrates this:

Arvore de Conjuntos Aninhados

POSITIVE POINTS: Much higher performance, ease of atravessar.
NEGATIVES: Much harder to implement than other methods; performance is not so much higher in SGDB that it does not support recursive queries (example: Mysql)

I recommend you ask a question here at Stackoverflow about this technique, so I or another user can demonstrate you improve such technique

4. Table of Relationship

Basically a "Many-to-many" table that stores all the links of a node with its descendants, similar to the technique of the enumerated path, but a little more flexible and with more performance than such.

P.S. I will improve this answer as soon as possible by citing more examples, but this is a giant subject, and each of the techniques would yield at least one question here in the OS

  • 3

    The nested sets I like very much to navigate all items with "next" and "previous", just sort by the value L to have all items in "read order".

  • 3

    The "Table of Relationship" (closure table) is particularly interesting because it allows you to go hereafter of a tree-like structure - allowing to represent any Directed acyclic graph (DAG).

  • 1

    @mgibsonbr I will improve this answer to include more examples, even being an extensive response, this is a really interesting theme

  • 3

    Sorry for the lack of modesty :) but I’m looking for interesting questions. I’m glad there are people giving interesting and surprisingly good answers. I’ll just disagree with the first sentence. We have numerous reasons to use the RDBMS for this. Examples: Most systems do not need this and having 2 systems increases complexity; trying an unknown but more suitable non-relational DB would probably result in something worse; another DB is not available (Binding for language or possibility of use in deployment); does not perform well other necessary tasks.

  • 1

    @bigown is not necessarily disagreeing with the answer, I share his thought, in a system it is strangely costly to maintain more than one bank, but what he said was that to treat data in tree, it is best to use something specific, but it is not always advisable

  • 2

    I ended up posting a separate answer. The techniques described are the same (after all your answer is quite complete, at first I was not going to put mine), but I organized in a different way. I hope it’s of some use...

  • @hernandev , waiting for you to finish your answer

Show 2 more comments

20

To determine the best structure to represent your tree, ask the following questions:

  1. The maximum depth is guaranteed very small? (i.e. a maximum of 3)

    If the answer is yes, a simpler solution may be ideal (the more complex soups "pay for themselves" when the stopover problem is big). The "Adjacencies List" (Adjacency List) - where each node has a foreign key for its parent - it is a simple and straightforward solution. However, there are authors who consider it a "naive solution" (naïve) - applicable yes in simple cases, but generally to be avoided.

    Note that you may need to write more than one query depending on the depth of the node (e.g.: one for depth 1, one for 2, one for 3), so the tree requirement is guaranteed to be shallow. Some Dbms support "recursive queries" (ex.: WITH, START WITH and CONNECT BY) that can simplify the task, but are not portable, which is beyond the scope of the question.

  2. The maximum expected depth is little?

    Another simple technique, but without the adjacency list problems is the "Enumeration of Paths" (Path Enumeration), where each node has a texual column where the list of idFrom the root to that knot. It is simple to either insert/remove/modify nodes or traverse them (either the sub-Ávore or its direct children), all you need is an operation in the textual field - usually involving prefixes.

    It is good to note that this table would be denormalized (is not in the normal first form) and does not have referential integrity (column would accept lists with ids non-existent), beyond which the paths can become very long in deep trees. (the space requirement of a node is proportional to its depth)

  3. The maximum expected depth is very large? There are knots that are ancestors of most other knots?

    If the answer is nay, consider using a Closure Table (which @Ernandes called "Relationship Table", but could also be called "Relationship Table Closing") - a separate table where each row represents a relation ancestral -> descendente (and maybe profundidade). This table provides quick access in exchange for larger storage space, and is efficient for virtually every type of operation.

    One drawback (if using pure SQL) is the need to keep two tables up to date, and relatively complex (but efficient) queries. And if the depth is too great, leaf nodes will have a high number of ancestors, each occupying a line at the lock. (A node’s space requirement is also proportional to the depth, but the overhead is larger - a line for each ancestor) An alternative technique can be interesting unless...

  4. Does your tree change frequently? Does your table represent more than one tree? (and it is feasible that two trees will be "merged" at some point?)

    If the answer is nay, The use of "Nested Sets" can offer a good performance with low space consumption. In it each node defines an interval (two numbers: lower/left limit and upper/right limit) and is considered "down" any node whose interval is contained (nested) in the interval of another node. In general, query queries are simple and efficient, and the space requirement is small and fixed for each node (two integers).

    However, although the removal of a knot is inexpensive, the insertion or locomotion of a knot can be very expensive, as it is necessary to update all their ancestors, their brothers (from right traditionally, but a variation would be to choose the side with fewer siblings) and the brothers of all their ancestors (same side). If the knot is not leaf, also their descendants. If time is more important than space, it is preferable to adopt Closure Table in that situation.

    Note: if your table represents more than one tree, there is an additional complication: unless each node identifies which tree it belongs to, the sets of one tree could interfere with the sets of the other. This would cause changes to a node to affect even nodes of other trees! And if the merging of trees is possible, the whole set of one of them would at least have to be redone.


Complete examples would be too extensive for this answer, but I will include a comparative table of the techniques mentioned (source: book "SQL Antipatterns", in English), and only the example of code requested in the question (i.e. query to cross a subtree).

Design          Tabelas  Consultar Filho  C. Sub-Árvore  Inserção  Remoção  I. Referencial

Lista de Adj.      1          Fácil          Difícil      Fácil     Fácil      Sim
+Query Recurs.     1          Fácil           Fácil       Fácil     Fácil      Sim
Enum. Caminhos     1          Fácil           Fácil       Fácil     Fácil      Não
Conj. Aninhados    1         Difícil          Fácil      Difícil   Difícil     Não
Closure Table      2          Fácil           Fácil       Fácil     Fácil      Sim
  • Adjacency List (node subtree 4, up to 3 levels):

    select * from comentarios where id = 4
    union
    select * from comentarios where parent = 4
    union
    select c2.*
        from comentarios as c1 join comentarios as c2 on c2.parent = c1.id
        where c1.parent = 4;
    
  • Path enumeration (node subtree 1/4/):

    select * from comentarios as c
    where c.caminho like '1/4/' || '%';
    

    (Remembering: look for us whose path has the prefix 1/4/)

  • Nested Sets (subtree of the node 4):

    select c2.*
    from comentarios as c1
    join comentarios as c2 on c2.esquerda between c1.esquerda and c1.direita
    where c1.id = 4;
    

    (Remembering: look for us whose interval - esquerda, and automatically direita - is fully contained within the node interval)

  • Closure Table (subtree of the node 4):

    select c.*
    from comentarios as c join closure as ad on c.id = ad.descendente
    where ad.ancestral = 4;
    

6

I would like to comment on the @Ernandes response, but it was a little long, so I put here as a response.

I’m using Adjacent List on some systems. A drawback is the work in doing things like a breadcumb. It also has a difficulty in determining when a node is disabled and preventing its dependencies from being displayed or even for the enabled ones checking for related and active dependencies (products from a virtual store, for example).

I do not advise any of these techniques (adjacency, nested, etc.) when the system is restricted to a small depth of up to 5 nodes, for example. If we look closely, a virtual store will hardly have subcategories more than 5 levels deep. Finally, before choosing, it is necessary to understand which business model.

Online Store? CRM? File System?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.