How to model a set of 1 to 4 predetermined values

Asked

Viewed 102 times

4

I am modeling a Mysql database where the main entities are documents. Among several other fields, these documents relate to one or more phases of a particular project.

There are 4 basic values (let’s call them A, B, C and D), and the property value phase is a set of 1 to 4 of them. For example, a document may be related only to phase A, phase A and B, phase B, C and D, etc.

I see several ways to model this, but I’m in doubt about which one would be more appropriate. The options I’ve considered:

  • 4 columns "boolean" (TINYINT(1)), one for each phase.
  • A single binary column with 4 bits (BIT(4)).
  • A column of the type set: SET('A','B','C','D').
  • Relationship table.

I will need to get/display both the list of phases of a given document and the list of documents linked to a given phase.

What would be the most appropriate structure in this case? What criteria should I take into account to make this decision?

1 answer

4


Unless you’re experiencing performance issues, I would go with the relationship table, otherwise it would complicate your darlings. However, if the number of documents is too large in relation to the phases, there is a less costly alternative that still preserves the normalization of the data:

Since the number of possible sets is small (16 at most), you can create a separate table to represent a set. This table of sets would basically have an ID - maybe encoded in this binary pattern, but not necessarily - and nothing else. Then you would create a relationship table between each set and its phases (1 + 4 + 6 + 4 + 1 = 16 lines).

Already the document, this would have a field that would function as a foreign key for the table of sets. A 4-bit integer is enough. Thus each document would have a data occupying as little space as possible, the overhead would be all in the table of sets (+ a Join additional, of course).

create table fase(
  id    serial primary key,
  nome  text
);

create table conjunto(
  id   tinyint(4) primary key
);

create table fase_conjunto(
  id_fase integer,
  id_conj tinyint(4)
);

create table documento(
  id    serial primary key,
  nome  text,
  fases tinyint(4)
);

To obtain all phases of a given document:

select f.*
from documento d
  join conjunto c on d.fases = c.id
  join fase_conjunto fc on fc.id_conj = c.id
  join fase f on fc.id_fase = f.id
where d.nome = "d2f12";

And all documents of a certain phase:

select d.*
from fase f
  join fase_conjunto fc on fc.id_fase = f.id
  join conjunto c on fc.id_conj = c.id
  join documento d on d.fases = c.id
where f.nome = "Fase 1";

Example in Sqlfiddle

  • Thinking here: if you do not make a point of Foreign Keys, the table conjunto is unnecessary. I just don’t know if it’s a good idea, after all there are only 16 lines... Anyway, you can save a Join, comparing fc.id_conj directly with d.fases.

  • Interesting idea, but I don’t really have as many documents to justify the complexity. I really doubt that performance is a problem with any of the alternatives I put in the question. That’s why I was left with no parameter to choose one of them. I liked the pro-relationship table argument, to preserve normalization. The other options seem to me more economical (do not require additional table), but apart from SET (which I think only exists in Mysql), they are ugly and not very self-explanatory.

  • Yeah, my main parameter here wasn’t even efficiency, but simplifying darlings. My experience with comics in general is limited, and I imagine that the other options would complicate the queries a little. But I may be wrong. Let’s see if anyone with more dominance of that area responds. P.S. Migrating from a binary column (option 2) to a scheme like this would not be difficult, and vice versa, so that there is a certain flexibility if it is discovered in the future not to have made the right decision.

  • I think deep down I want to know which option is more "elegant". Does my question is based on opinion?

  • I don’t know... for me, "elegant" in the context of a relational comic will always mean "normalized", so I would reject the binary column and maybe what it uses set (I have no experience, but a column set would not violate the first normal form? ). As for the others, my preference for the relationship table is because it "homogenizes" the darlings - while the boolean columns would require a different query for phase A, another for phase B, etc (i.e. each complex query involving a phase would have to be adjusted for each boolean column, you cannot use a single parameterized query).

  • I would need to study further, but from my quick research there are controversies as to whether or not sets violate 1NF. Anyway, I think you’re right, in terms of elegance the relationship table is standard, and it’s always a safe option.

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.