First let’s understand that denormalize is not the same thing as not normalizing. Denormalize is a conscious process with a clear goal. A well-standardised database is usually faster than a database non-standard. Each normal form is more extreme than the other. Reducing from one normal way to another would be denormalize? Or would it just be reducing normalization? I don’t know.
One of the biggest advantages of normalization is to facilitate data integrity by eliminating redundancy. Denormalization needs to be done very carefully.
Normalization applies well to the relational model. I will not go into cases where this model does not apply well, after all it would not be quite the case to denormalize.
I just see a great interesting reason for denormalizing a database.
Performance
After being measured that the physical design is hindering the performance and consequently the scalability of the database, can be interesting denormalism-it with discretion. It does not mean that the logical design should be denormalized.
OLAP X OLTP
The most common is to do in data or parts of the database or even in specific databases to OLAP. For essentially consultation, analysis, simulation and projection operations. For applications of BI, Data Warehousing, etc..
In simplified, local databases, running on low-performance equipment and capacity that perform specific satellite tasks (PDV terminal, for example) may also benefit from the performance gain.
Note that in both the denormalization You end up getting more into an auxiliary database. And when it occurs in the same database, it is done in parts that are inherently used for queries. In typical activities OLTP to denormalization is not usually such an interesting option.
But copying data for analysis really is denormalization? Affects the integrity of data?
Can’t.
Creates redundant data?
I have doubts.
I found no reliable canonical references to determine this. But something tells me that it was never the intention of normalization to work upon OLAP activities. When it was created only OLTP was thought of. My experience (yours may vary) is that denormalization in OLTP usually more hinder performance than help.
Techniques of performance improvement
In fact there are not so many cases that actually benefit from the denormalization. It’s that thing of fashion or understanding in half, or even applying something to the wrong context. Most often there are other techniques to lessen the impact on the database before having to resort to denormalization. There are cases that denormalize solves one problem and causes another.
Some techniques of denormalization formal:
A example in the OS.
Simplifying
Another reason that is not very good but can be valid and acceptable is the simplification of the database.
There is a difference between making your Manuel’s bakery system and a large system for a large company. Allowing a list of phones in the register of a system may not be a problem. But controlling all the contacts of a complex company does not give much room to register the data anyway. Imagine a multi-contact partner who changes their phone number. You want to change this phone number, probably preserving the internal extensions of all contacts. Will you change one by one? What if you need to search your contacts' phone numbers? It’s best to have this data in a separate table.
Some simplifications can bring more problems than solutions. Ensuring the integrity of denormalized data can be complicated and risky. It’s not real simplification.
Expediency
Some say denormalize makes reporting easier. But I think this falls under the OLAP issue. Want to have the easiest data for the report generator? Okay, generate an appropriate set of tables with the necessary data, possibly temporarily, and make it available for the report. If you want only convenience for data access create views or use another technique that makes it easier for you. Denormalization cannot be used for convenience.
Taking your example
Imagine if New York decide to call Old York. Can you fix it? I don’t know. How many cities New York different exist in the register? New York It’s not the identity of information, it’s a state of it. Let’s say there’s a way to fix it. Wasn’t it to simplify? Or was it just XGH?
If you really need to simplify, violating the first normal way will cause less difficulties. You can use the type array of Postgresql or any technique that simulate this feature in other database systems. It’s only complicated if the database uses fixed line size. There can be a lot of waste. But I wonder if it’s real simplification. It takes work to deal with exceptions.
But it is worth the most that if you really know what you are doing, everything is valid. Problem is, if you don’t know, you can’t judge.
What it seems denormalization but it’s not.
There are cases where you might think you are denormalizing but is only adapting the project to the domain.
A typical example.
A tax bill usually has a reference to the business partner and the items in the note usually have references to linked products. It’s normalized and I doubt if anyone thinks it’s good denormalize the separation of note header items.
If a partner’s address, a customer for example, changes after this note is issued, what happens to the address in the note? And if the price of a product sold in this note changes?
You have a beautiful inconsistency and are doing something illegal (at least in Brazil). Bringing customer or product data to the note is not denormalization. It is to ensure the correct archiving of the information. The data from the partner’s general register has an objective. The details of the partner’s registration at the time of issuance of the invoice has another goal. These are different things. They can be coincidental but synchronising them, trying to get consistency, cause the opposite, makes them inconsistent.
The semantics of the partner data contained in the note determine that they are immutable after issuance, so you have to ensure this. Copying the customer’s data to the invoice solves this (it’s amazing how some systems, even known ones, overlook this). Copying product data is not only necessary for immutability. Actually the data can take another form. The price may be special for this transaction, the tax may be differentiated for some specific reason. And the description can also be customized (in a bid the note should come out with the description of the notice and not what the winner of the bid has in its register of products) becoming another information that used as a basis the registration of products. The moment you copy the data it is decoupled from its origin. If the information is different, copying is not denormalize. Just because you have the number "1" in two different places doesn’t mean they represent the same thing.
Even so, I would suggest a normalization of this information. To ensure the immutability of partner data perhaps changes in partner registration should be made in multiversion. Here it is possible to refer directly to the active version at the time of issuing the note. Creating multiple versions of the register may seem denormalization. But if you need the information repeated, if you’re on semantics of the mechanism, normalize is impossible, any attempt in this direction is to destroy the mechanism.
This mechanism has advantages even to solve other problems and is useful in several cases. But it also has disadvantages. Therefore another purer standardization solution for the purpose may be more viable in other cases.
Have a table only with the required partner data when they are used in a invoice. A new version of this data will only be registered if there are changes in the registration when issuing a new note to the partner. This avoids having copies of data that are not used in any invoice and having copies of data that do not change for years. This model seems so obvious to me that it seems embarrassing that the systems do not use some variation of it.
I have gone into too much detail about this. What matters is that the ideal is to change the normalization and not denormalize, not least because the requirement system uses other information.
Calculated fields
Whether you are using accumulator fields or other calculated fields may be creating new information or not. There are cases that it is really redundant and fits a normalization and consequently a denormalization can be useful to improve performance (see first section). In extreme cases there are calculations that can take hours.
But many cases the field simply has a need of its own. Example: leaving a monetary value to be converted to another currency only when you need it is a normalization that can cause you problems in accounting. This is a calculated field that generates new information, it cannot be normalized, therefore it cannot be denormalized.
We often think about denormalize which was not normalized. This makes no sense.
Gambiarra
Yeah, let’s be frank, sometimes we need to do gambiarras :) Not interesting, but necessary. This is a reason to denormalize.
Completion
I didn’t put more specific examples because I think it would become a list, it would become very broad.
You want a good reason not to denormalize? It is difficult to define well how to do this as opposed to standardisation which is a relatively simple and well-known concept. Without a good mastery of SQL or relational model and measurement tools and ability to interpret the results, it is highly likely that the denormalization will worsen the situation.
On the other hand, holding too much to theory to one side or the other will bring more problems than solutions. I could say that doing X violates the normal form Y and this cannot. But it can bring advantages without causing any problem in the specific case.
I know programmers and Dbas who will say the opposite of all this. I myself do not always follow what I said here.
It was good to revisit the subject because I discovered things I didn’t know. I just didn’t post because I was outside the scope.
It would be nice for you to explain in question how you define what is a normalized bank and a denormalized one. It would help both future visitors to learn, and whoever answers would not have so much space to get out of focus. This way maybe it is a little wide, and giving room for interpretation.
– Bacco
I followed your suggestion @Bacco, I put an example. =)
– user7261
I find the second part interesting when the data repeat few or no times, residential phone issue, address (there are chances of repeating the data to the same person, but too small to create a separate table) and for scalability issues sometimes.
– Leonardo Bosquett
A possible scenario that would make the denormalized bank harmful even more evident would be the location. In the second model it would be necessary to rely on a secondary resource, like Gettext, when in the first one it would be enough, for example, to have an additional column serving as a secondary ID, suddenly relating to a third table, of languages.
– Bruno Augusto
The curious thing is that the second example does not demonstrate a denormalization - it does not disrespect any of the normal forms. Namely: a person does not and does not belong to N cities, so 1FN has not been violated; the table has no composite natural primary key, so it is impossible to violate 2FN; city is in fact an attribute that describes the primary key (it is an attribute of the person and not an attribute of another field), so 3FN was not violated either; 4ª and 5FN (more complex) were also not violated. The second example is just a repetition of text taking up space that could be saved.
– Caffé