When is it interesting to denormalize the database?

Asked

Viewed 8,960 times

34

Is there any situation where having the database denormalized is more interesting than having it normalized?

For example

Normalized

  Pessoas                  Cidades
| nome    | cidade_id |  | id | nome        | 
-----------------------  --------------------
| Fulano  |         1 |  |  1 | Nova Iorque |
| Ciclano |         1 |

Denormalized

  Pessoas
| nome    | cidade      |
-------------------------
| Fulano  | Nova Iorque |
| Ciclano | Nova Iorque |

At some point the second case is more appropriate than the first?

  • 1

    It would be nice for you to explain in question how you define what is a normalized bank and a denormalized one. It would help both future visitors to learn, and whoever answers would not have so much space to get out of focus. This way maybe it is a little wide, and giving room for interpretation.

  • I followed your suggestion @Bacco, I put an example. =)

  • I find the second part interesting when the data repeat few or no times, residential phone issue, address (there are chances of repeating the data to the same person, but too small to create a separate table) and for scalability issues sometimes.

  • A possible scenario that would make the denormalized bank harmful even more evident would be the location. In the second model it would be necessary to rely on a secondary resource, like Gettext, when in the first one it would be enough, for example, to have an additional column serving as a secondary ID, suddenly relating to a third table, of languages.

  • The curious thing is that the second example does not demonstrate a denormalization - it does not disrespect any of the normal forms. Namely: a person does not and does not belong to N cities, so 1FN has not been violated; the table has no composite natural primary key, so it is impossible to violate 2FN; city is in fact an attribute that describes the primary key (it is an attribute of the person and not an attribute of another field), so 3FN was not violated either; 4ª and 5FN (more complex) were also not violated. The second example is just a repetition of text taking up space that could be saved.

3 answers

27


First let’s understand that denormalize is not the same thing as not normalizing. Denormalize is a conscious process with a clear goal. A well-standardised database is usually faster than a database non-standard. Each normal form is more extreme than the other. Reducing from one normal way to another would be denormalize? Or would it just be reducing normalization? I don’t know.

One of the biggest advantages of normalization is to facilitate data integrity by eliminating redundancy. Denormalization needs to be done very carefully.

Normalization applies well to the relational model. I will not go into cases where this model does not apply well, after all it would not be quite the case to denormalize.

I just see a great interesting reason for denormalizing a database.

Performance

After being measured that the physical design is hindering the performance and consequently the scalability of the database, can be interesting denormalism-it with discretion. It does not mean that the logical design should be denormalized.

OLAP X OLTP

The most common is to do in data or parts of the database or even in specific databases to OLAP. For essentially consultation, analysis, simulation and projection operations. For applications of BI, Data Warehousing, etc..

In simplified, local databases, running on low-performance equipment and capacity that perform specific satellite tasks (PDV terminal, for example) may also benefit from the performance gain.

Note that in both the denormalization You end up getting more into an auxiliary database. And when it occurs in the same database, it is done in parts that are inherently used for queries. In typical activities OLTP to denormalization is not usually such an interesting option.

But copying data for analysis really is denormalization? Affects the integrity of data?

Can’t.

Creates redundant data?

I have doubts.

I found no reliable canonical references to determine this. But something tells me that it was never the intention of normalization to work upon OLAP activities. When it was created only OLTP was thought of. My experience (yours may vary) is that denormalization in OLTP usually more hinder performance than help.

Techniques of performance improvement

In fact there are not so many cases that actually benefit from the denormalization. It’s that thing of fashion or understanding in half, or even applying something to the wrong context. Most often there are other techniques to lessen the impact on the database before having to resort to denormalization. There are cases that denormalize solves one problem and causes another.

Some techniques of denormalization formal:

A example in the OS.

Simplifying

Another reason that is not very good but can be valid and acceptable is the simplification of the database.

There is a difference between making your Manuel’s bakery system and a large system for a large company. Allowing a list of phones in the register of a system may not be a problem. But controlling all the contacts of a complex company does not give much room to register the data anyway. Imagine a multi-contact partner who changes their phone number. You want to change this phone number, probably preserving the internal extensions of all contacts. Will you change one by one? What if you need to search your contacts' phone numbers? It’s best to have this data in a separate table.

Some simplifications can bring more problems than solutions. Ensuring the integrity of denormalized data can be complicated and risky. It’s not real simplification.

Expediency

Some say denormalize makes reporting easier. But I think this falls under the OLAP issue. Want to have the easiest data for the report generator? Okay, generate an appropriate set of tables with the necessary data, possibly temporarily, and make it available for the report. If you want only convenience for data access create views or use another technique that makes it easier for you. Denormalization cannot be used for convenience.

Taking your example

Imagine if New York decide to call Old York. Can you fix it? I don’t know. How many cities New York different exist in the register? New York It’s not the identity of information, it’s a state of it. Let’s say there’s a way to fix it. Wasn’t it to simplify? Or was it just XGH?

If you really need to simplify, violating the first normal way will cause less difficulties. You can use the type array of Postgresql or any technique that simulate this feature in other database systems. It’s only complicated if the database uses fixed line size. There can be a lot of waste. But I wonder if it’s real simplification. It takes work to deal with exceptions.

But it is worth the most that if you really know what you are doing, everything is valid. Problem is, if you don’t know, you can’t judge.

What it seems denormalization but it’s not.

There are cases where you might think you are denormalizing but is only adapting the project to the domain.

A typical example.

A tax bill usually has a reference to the business partner and the items in the note usually have references to linked products. It’s normalized and I doubt if anyone thinks it’s good denormalize the separation of note header items.

If a partner’s address, a customer for example, changes after this note is issued, what happens to the address in the note? And if the price of a product sold in this note changes?

You have a beautiful inconsistency and are doing something illegal (at least in Brazil). Bringing customer or product data to the note is not denormalization. It is to ensure the correct archiving of the information. The data from the partner’s general register has an objective. The details of the partner’s registration at the time of issuance of the invoice has another goal. These are different things. They can be coincidental but synchronising them, trying to get consistency, cause the opposite, makes them inconsistent.

The semantics of the partner data contained in the note determine that they are immutable after issuance, so you have to ensure this. Copying the customer’s data to the invoice solves this (it’s amazing how some systems, even known ones, overlook this). Copying product data is not only necessary for immutability. Actually the data can take another form. The price may be special for this transaction, the tax may be differentiated for some specific reason. And the description can also be customized (in a bid the note should come out with the description of the notice and not what the winner of the bid has in its register of products) becoming another information that used as a basis the registration of products. The moment you copy the data it is decoupled from its origin. If the information is different, copying is not denormalize. Just because you have the number "1" in two different places doesn’t mean they represent the same thing.

Even so, I would suggest a normalization of this information. To ensure the immutability of partner data perhaps changes in partner registration should be made in multiversion. Here it is possible to refer directly to the active version at the time of issuing the note. Creating multiple versions of the register may seem denormalization. But if you need the information repeated, if you’re on semantics of the mechanism, normalize is impossible, any attempt in this direction is to destroy the mechanism.

This mechanism has advantages even to solve other problems and is useful in several cases. But it also has disadvantages. Therefore another purer standardization solution for the purpose may be more viable in other cases.

Have a table only with the required partner data when they are used in a invoice. A new version of this data will only be registered if there are changes in the registration when issuing a new note to the partner. This avoids having copies of data that are not used in any invoice and having copies of data that do not change for years. This model seems so obvious to me that it seems embarrassing that the systems do not use some variation of it.

I have gone into too much detail about this. What matters is that the ideal is to change the normalization and not denormalize, not least because the requirement system uses other information.

Calculated fields

Whether you are using accumulator fields or other calculated fields may be creating new information or not. There are cases that it is really redundant and fits a normalization and consequently a denormalization can be useful to improve performance (see first section). In extreme cases there are calculations that can take hours.

But many cases the field simply has a need of its own. Example: leaving a monetary value to be converted to another currency only when you need it is a normalization that can cause you problems in accounting. This is a calculated field that generates new information, it cannot be normalized, therefore it cannot be denormalized.

We often think about denormalize which was not normalized. This makes no sense.

Gambiarra

Yeah, let’s be frank, sometimes we need to do gambiarras :) Not interesting, but necessary. This is a reason to denormalize.

Completion

I didn’t put more specific examples because I think it would become a list, it would become very broad.

You want a good reason not to denormalize? It is difficult to define well how to do this as opposed to standardisation which is a relatively simple and well-known concept. Without a good mastery of SQL or relational model and measurement tools and ability to interpret the results, it is highly likely that the denormalization will worsen the situation.

On the other hand, holding too much to theory to one side or the other will bring more problems than solutions. I could say that doing X violates the normal form Y and this cannot. But it can bring advantages without causing any problem in the specific case.

I know programmers and Dbas who will say the opposite of all this. I myself do not always follow what I said here.

It was good to revisit the subject because I discovered things I didn’t know. I just didn’t post because I was outside the scope.

3

There is. It is at a very specific time that is the creation of datamarts for BI. It’s what they call star schema versus Snowflake schema.

I’ll give my star schema view because Snowflake is wrong in the context of BI datamarts. We denormalize to star schema, we do this denormalization consciously because there are techniques for this, so we can save the joins and thus gain speed, purely gain speed and save cpu time, for processing large amounts of data for famous BI reports.

This denormalization is worth much for the context and uses I mentioned above.

  • A more in-depth explanation of some BI concepts is needed. Although a Snowflake modeling is more used in conceptual proof designs, its application in some dimensions of datamarts is feasible, due to the smaller context where it is applied (a single business activity)but when the data is converted to the central Data Warehouse, it is interesting that the modeling of this dimension is converted to a star schema.

  • I’ll modify the answer when I have a little more time, these days have been very busy for me. But I think there’s been a mistake on your part because datamart is above the Datawarehouse and not before, so the data that is there is normalized. At Datawarehouse we want the unique and temporal data, that is to say in time, 5 years 10 years. Datamart is after and can be denormalized for better reporting performance upon these data that are not current, real time. @gmsantos, from my point of view I think.

  • Flávio, depends on the architecture of DW. You mentioned Inmon’s approach where Datamart is the access interface between the User-DW. In my comment I have based on Kimball’s approach, where DW is built by a set of datamarts. See Kimball vs Inmon and Architectures for Datawarehouse

  • Yeah, we agree we talk about different things. What Kimball calls datamart Inmon calls ETL, in this layer we agree that it sometimes has denormalized data, because it is the normalization process for DW. I think I see your point.

0

Another example, that people even disagree as a model, is a market ERP that denormalizes tables because its search engine in the system screens does not allow a search in columns of secondary tables, for example in the titles tables the client’s name is copied to this table when by normalization the code of the client table would suffice. To have a search by name in the titles the column is copied.

  • 2

    You in many cases MUST copy data to the secondary table as it is necessary to record the fact. A subsequent change to a name in a register may under no circumstances reflect retroactively on an already issued document, such as an NF or bank note, as well as a product already sold that will have its name updated in the DB after the sale must have preserved the original name in the corresponding document. In these cases it is very common to copy data, unless you use a DB with temporal search capability (which is a rare thing to see in use).

  • 1

    PS: I just mentioned it as a counterpoint. I agree that there is a lot of system in the market (including those of a well-known national company) that need to do what you said, but how to do it, but there is already a widespread problem :D

  • @Bacco I understand the point but I do not know if it justifies a denormalization, the question could be solved with a version (temporization) as you said, I confess that I never thought about the possibility of supplier change name, there should be another cnpj and then another record would have to be created ?! An interesting point !

  • 1

    People, businesses, cities, products, or anything that has a name, may have its name changed without characterizing change of entity. Would the company also have to change the CNPJ if it changed its address? In theory, the only change that would create a new entity would be the CNPJ itself. I wouldn’t count on it. A government changes the CNPJ’s system probably by forcing a re-registration. Okay, screwed (at least it has a viable solution). That’s why you need to be careful with natural primary keys. Temporal DB is not the solution for everything. If it were, I’d be wearing world.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.