Store data in DB vertically or horizontally?

Asked

Viewed 75 times

3

I recently posted a question about number of fields in a table here: Number of Columns X performance in tables

Now I am curious about the standardization in the creation of the table below, I have the following situation:

Table of Plans:

Example 1: (as I sometimes see)

create table planos(
  pl_codkey (primary),
  pl_nome_plano varchar(80),
  pl_parcela_1 decimal(12,2),
  pl_parcela_2 decimal(12,2),
  pl_parcela_3 decimal(12,2),
  pl_parcela_4 decimal(12,2),
  pl_parcela_5 decimal(12,2),
  pl_parcela_6 decimal(12,2),
  etc.
)

Notes example 1: We have to create SQL according to the portion (column) sought, I see two ways 1: fill the system ifs/cases or concatenate the portion number in the column name when creating the SQL string, we have another detail if we need more plots we have to add more fields, imagine a property plan where we have 180 plots or more.

Advantage in example 1: save the data in a single line. Downside: The treatment for searching and displaying data requires more programming and need to tweak the database if you need to add more fields, unless you already create a table with 500 fields :-).

Example 2 (how do I)

create table planos(
  pl_codkey (primary),
  pl_nome_plano varchar(80),
  pl_parcela int,
  pl_taxa decimal(12,2)
)

Notes example 2: I do the search by pl_codkey and plot

In example 2 we have the easiness in the survey and elaboration of the table, because no matter the number of installments the bank is already prepared.

Advantage: Does not require table structure maintenance

Disadvantage: Requires a little attention if you work with Groups of Plans, we will have more records according to the number of plots and plans.

Now thinking:

Imagine inserting 1000 clients with 36 installments plans, we would have 36 thousand rows with 4 columns and in example 1 and we would have 1000 rows with 36 columns.

I wonder if in example 2 I can slow down faster than in example 1 over time

  • i edited my answer on an important detail about having plan id (pl_codkey ) and portion in the same table, this will give problem ;)

3 answers

3


Both can be used, of course. One is the formally correct, the other can give more efficiency in the system, in addition to the differences already noted by the author of the question.

One table, several lines (wrong)

From the point of view of normalization, among those presented, the second model seems to be the most appropriate. In fact it is quite possible that neither it is right, ie I’m almost sure that strictly the second is wrong.

Two related tables (formally correct)

Probably the right one would be a third form keeping this table of plans without the plots and have an auxiliary table only with the plots. I cannot say whether this is the most correct because I do not know the whole real problem and the example shown is artificial to make such a statement, although there are indications that the most correct one would be the auxiliary table. The second form does not meet the correct normalization, and already breaks the first normal form (among the 6 currently known).

Normalization indicates that you cannot have redundant information, nor have information without completion. So if you have lines whose fields have the same information (not by coincidence) it is being redundant. pl_nome_plano is clearly redundant, you can not create multiple lines with this same information being that it should be unique, is thinking to create many lines for a single item, this hurts the normalization and creates difficulties. So the second model would be not only inappropriate, but formally wrong. Only if adapting it to the auxiliary table would it be correct, and depending on the situation would be appropriate.

Single table, multiple fields (acceptable)

The first case can be perfectly acceptable as an exception to the rule, since although it is formally wrong, it does not create serious problems, and can compensate for its use if it needs efficiency (not always). Right must be a priority.

If everything is in one table, in one row you can take the data with less effort in the database and this is useful in some scenarios. The great disadvantage of this is that it needs to have a field for each possible plot and in most cases many of these plots will be null, which will take up some space (some Dbs have a very low cost, depends on the implementation of each one, but for a large volume of fields there will be space waste).

Depending on the database system adopted this could be put in a array of fields, or could simulate this array in a text field or blob. It is not so necessary, there are advantages and disadvantages in doing this.

This model has more Nosql face (which is actually the model that ignores relationships).

It’s easier to access the data directly, but you need to manipulate it in some situations, as the question already notes. But the relational model is not carefree, as it is also observed, and indeed can be even more laborious depending on the scenario.

Completion

The amount of lines themselves is no problem.

So to be clear, the fastest way tends to be the first model. The second one tends to be the second best case and the third one I exposed using auxiliary is only the third fastest, under typical conditions, the use of join It’s usually the worst performance, so the myth has been created that Nosql is faster, comparing different things. Only the third way (which I defined) is formally correct.

The second example of the question has an error in the primary key, failed to include the portion number in it to give a single key. It would be something like this:

create table planos(
    codkey (primary),
    nome_plano varchar(80)
)

create table planos_parcelas(
    codkey (primary),
    item int (primary),
    valor decimal(12, 2)
)

I put in the Github for future reference.

Note that you do not need foreign key, this is waste.

If you don’t want to violate any normal way it would have to be. Not always this should be priority, in many cases I would adopt the first model of the question.

I still don’t know if the model as a whole is totally correct. It is still confusing to have values of plots within the plans, for me the whole concept is wrong, but it may just be lack of enough information, or wrong nomenclature. Plans should not have values, but if there is a lack of explanation (not that it is relevant to the question problem, but may hide a more serious error in modeling), or it may be that this is not a plan, but the use of the plan, which is an account or something like that. But there may be a justification for it, I’m just speculating.

  • In fact in my base I have the table "auxiliary" (which is actually the mother table) that would be the "group_planes" and in the table of plans I have the group_codkey instead of pl_name_plan, I posted this way precisely to raise the question and suggestions like this yours.

  • I do not know well, but maybe this group of plans is gambiarra, IE, not group, anyway, by what is in the question is either group or is taking the worst of both worlds, maintaining auxiliary table and redundancy.

  • In the group of plans I have the name of the group and the number of maximum plots of that plan, so in the flat table only enters the amount of records according to the plan... I did not post as I did because the question refers to store the data horizontally or vertically, but your answer and the other members show that the horizontal error would be to have blank fields for no reason when the amount of plots does not reach the field total.

  • It seems to me that this is really a group, so it’s not relevant to the problem, but don’t confuse that with the auxiliary table. Not blank field, null https://preview.redd.it/0k7yrif7bkg11.jpg?width=960&crop=smart&auto=webp&s=3e51a116dca547d8207b1e88859d04589404deca, if you have a maximum of 360 fields, you can have 45 bytes of overhead on all table rows. , so a form of array or string/blob could be better if you adopt this form for efficiency. If there are few fields and few fields remain blank, then it can be something acceptable, everything depends on the exact problem.

  • Remembering that performance should not be a priority, but it should not be neglected.

3

By normalization using the normals forms (1F to 5F) its first model would not be "correctly normalized".

The third normal form (3F), which is the form that moves the fields that are dependent on each other to another entity, says that:

3FN defines that all attributes of this table must be functionally independent of each other, while they must be dependent exclusively on the primary key of the table

Source: https://pt.wikipedia.org/wiki/Normalizationof data

That is, can you have parcel 4 without having the first 3? or have only the 6 only having the 1 and 3? I believe not, are clear cases of dependence on one another of the columns, so it would not be correctly normalized.

The second model does satisfy the normalization.

About the statements about the second modeling:

Upside: Does not require maintenance of table structure

Downside: Requires a little attention if you work with Groups of Plans, we will have more records according to the number of plots and plans.

About the disadvantage this is a fact: the more normalized your model, more entities will have more segregated, that is, but independent and separate, which will require more work when doing the query (the more tables more JOINS :) )

But there is a counter-argument in relation to "we will have more records according to the number of plots", and will also have fewer unnecessary records if you have fewer plots :)

More important points:

  • The table of "plans" is to have only plan data? Because if it is violates even more the normalization. It should have a table "plans" and another "plots_plans" to satisfy the 3F (normal third form).
  • In your model only having "portion" will give you a problem with the primary key!

Let’s take a practical example:

create table planos(
  pl_codkey (primary),
  pl_nome_plano varchar(80),
  pl_parcela int,
  pl_taxa decimal(12,2)
)

insert into planos(pl_codkey,pl_nome_plano,pl_parcela,pl_taxa )
           values (1, 'Plano 1', 1, 10);

How to insert another look for the same plan? this command below will violate the primary key by inserting the second installment for the same plan:

insert into planos(pl_codkey,pl_nome_plano,pl_parcela,pl_taxa )
           values (1, 'Plano 1', 2, 12);

That is, you need to separate the plots from the plans!!!

create table planos(
  pl_codkey (primary),
  pl_nome_plano varchar(80)
)

create table planos_parecelas(
    ppp_codkey (primary),
    pl_codkey (foreign key de planos)
    pl_parcela int,
    pl_taxa decimal(12,2)
)

And then the Insert:

insert into planos (pl_codkey,pl_nome_plano)
            values (1, 'Plano 1');

-- aqui as parcelas do 'Plano 1'

insert into planos_parecelas(ppp_codkey,pl_codkey,pl_parcela,pl_taxa)
    values (1, 1, 1, 10);

insert into planos_parecelas(ppp_codkey,pl_codkey,pl_parcela,pl_taxa)
    values (2, 1, 2, 12);
.. etc...

And finally your question:

I wonder if in example 2 I can be slower than that in example 1 over time

No, because in fact you would be "optimizing" your database to work with separate information!

Imagine that you only want to know about plans, if separating plot plans can make select in a table with much less records and faster. The table that would have appeared would have many records but it has few columns and would be easily filtered by the keys :)

0

The first model is unsustainable. Imagine that you design the database to support up to 180 installments. That is, a table with + 180 columns.

  1. Any installments that have less than 180 installments will entail some waste of disk space. Imagine that Zézinho bought a product in 10x. It will occupy the first 10 columns - the other 170 will be blank. Scaling this up to multiple records, you spend a lot of storage, very fast.

  2. If eventually you want to support larger installments (with more installments), you will need to change the database.

With this, the second model is more suitable and flexible. If there is any concern about the performance of consulting a table with thousands of rows, you can normalize it as follows: in a table, you store the basic information of the installment. That is: name of the debtor, number of installments, etc. In another, you will store the actual installments, with foreign key pointing to the installment table.

Thus, you have an easier database to handle, without sacrificing performance. It will be possible to quickly obtain information on the plots, and if it is necessary to know specifically about the plots, just run a single JOIN.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.