What is database normalization?

Asked

Viewed 10,585 times

75

I was talking to some programmer friends when I heard the term "database normalization". Although I’ve heard that term a few times before, I couldn’t understand what it was about.

Some doubts have arisen:

  • What does that term mean?

  • Any type of database (either SQL or Nosql) can use this such normalization?

Another thing is, while searching the Internet, I found the terms "data normalization" and "database normalization". Is there any difference between the two things ("data normalization" and "database normalization")? Is there a correct term or can be spoken in both ways?

I don’t want to be embarrassed to hear a term like that in a meeting and not understand :)

  • 3

    Related: http://answall.com/q/22069/101. In a certain way it is to do a DRY in the database: http://answall.com/q/120931/101

  • 7

    This is a community where the questions work in a way that helps not only my person (who wanted to answer those questions) but other people as well. Therefore, I suggest to those who voted 'no' to show something defective in the question, so that I can correct and thus help those who might be interested in this question in the future.

4 answers

86


What is normalization?

Database normalization is the process of transformations in the structure of a database that aims to eliminate redundancies and eliminate insertion, update and deletion anomalies.

When performing the normalization process, the data registered in the database will be better organized and most of the time will also occupy less physical space. However, the normalization process also always increases the number of tables and in many cases can be a difficult task to perform. In addition, standard databases beyond what is necessary may have poor performance and/or excessive complexity.

Anomalies

The main purpose of the standardisation process is to eliminate insertion, update and exclusion anomalies. The anomaly occurs when there is no way to register certain information without some other information being directly registered. For example, imagine you have a table funcionário with the following data: codigo, nome, projeto, where the column projeto corresponds to the name of the project in which an employee was allocated. And then you have the following data:

code name project
1 Peter Sales
2 Maria Sales
3 Carlos Customer registration

And then came a new project: The issue of tax bills. How do you register this new project? The answer is that you can not register, because to do this you would have to have some employee in this project - IE, we have an insertion anomaly.

If in the previous example, the employee Carlos was disconnected from the company and removed from the table, the information about the client registration project is lost. This is an unwanted side effect - it is the exclusion anomaly. If, however, it were only moved to the new bill bill, we would also lose the information about the existence of the customer registration project - this is the anomaly of change.

The problem that causes these anomalies is that the project information is all within the employee table, which is not its place. If we have two related tables (1-to-N) - employees and projects - the anomalies disappear.

Anomalies also relate to the concept of referential integrity.

First normal form (1NF)

The first rule to eliminate anomalies is:

There shall be no multivariate columns.

What is a multivariate column? It is a column in which it is possible to store more than one value per record. For example, imagine that you have the columns in your client table codigo, nome and telefones, thus filled in:

code name telephones
1 Paul 99567-4289, 3605-4900
2 Maria 97223-0444
3 Alan -
4 Juliana 2853-0266, 96610-5480, 2410-9941

Note that the phone column is multi-valued. Inserting, updating and deleting phones in this scheme is a tricky thing to do. Searching for a specific phone number is also tricky. The solution to this is to split into two tables:

Client:

code name
1 Paul
2 Maria
3 Alan
4 Juliana

Telephone:

code client code number
1 1 99567-4289
2 1 3605-4900
3 2 97223-0444
4 4 2853-0266
5 4 96610-5480
6 4 2410-9941

By eliminating all multivariate columns, the database reaches a structural form called normal first form, or simply 1FN.

Candidate keys and superchaves

Before proceeding with the other normal forms, it is necessary to introduce the concepts of candidate keys and superchaves.

The primary key (Primary key, or just PK) is that set of columns that serves to identify the tuple in a unique way (can be only one column, or can be two or more). It is important that the database designer knows how to identify which columns are most appropriate to be chosen as part of the primary key. Often, a column with a sequential numeric code may serve to be the primary key, but there are several situations where this is not the case.

Sometimes there is more than one set of columns that could be primary key. Each of these sets is called candidate key. For example, in a table Pessoa that has the fields CPF, RG, Estado and Nome, both the CPF as to the RG along with the Estado are candidate keys. Thus, it is possible to reach the Nome from the CPF, but it is also possible to achieve Nome from the RG and of Estado.

Any set of columns that has a subset, a candidate key is called a superchave.

Second normal form (2FN)

The second normal form is the one that says:

All columns must have functional dependency with the totality of each candidate key.

In most cases by "each candidate key", is understood by "with the primary key", unless there is more than one candidate key.

Also, for a table to be in the second normal form, it must be in the first normal form.

A column is in functional dependence with the primary key when it is determined in the field of application through the primary key. A column has no functional dependency with the primary key when it is defined independently of the primary key or when it is defined from some other column than the primary key.

A functional dependency can be said to be in the totality of the primary key when all fields of the primary key are necessary to establish the dependency relation. In case the primary key is composed, it is possible to have a partial dependence.

To give an example of 2FN, imagine that your company has sales representatives acting on customers and wants to represent the relationship of which sales representatives act on which customers. On this table (let’s call representação), we have the columns nome_cli for the customer name and nome_repr for the representative name as primary keys and we also have the columns end_repr to the address of the representative, end_cli to the customer’s address and valor_contrato. That is to say:

Depiction:

nome_cli (PK) name_repr (PK) end_repr end_cli contract value
Louis Cleyton 888 Street X Street 123 R$ 500,00
Paula Lara Rue B 999 Rue Y 345 R$ 900,00
Paula Cleyton 888 Street Rue Y 345 R$ 650,00

Note that the columns end_repr and end_cli depend on the primary key. But they do not depend on all the primary key, each one depends only on part of it. The solution in this case is to have a table of customers (with the customer’s address), a table of commercial representatives (with his address as well) and to leave in the table of action the respective foreign keys with the value of the contract. That is to say:

Customers:

nome_cli (PK) end_cli
Louis X Street 123
Paula Rue Y 345

Representatives:

name_repr (PK) end_repr
Cleyton 888 Street
Lara Rue B 999

Depiction:

nome_cli (PK) name_repr (PK) contract value
Louis Cleyton R$ 500,00
Paula Lara R$ 900,00
Paula Cleyton R$ 650,00

Third normal form (3FN)

The third normal form is the one that says:

All columns must have functional dependency with the totality of each candidate key and nothing more than these candidate keys.

Again, in most cases by "each candidate key", is understood by "with the primary key", unless there is more than one candidate key. If the only existing candidate key is the primary key, this would look like this:

All columns must have functional dependency with the whole primary key and nothing more than the primary key.

Moreover, for a table to be in the third normal form, it must be first in the second normal form (and also in the first). The part of depending on the totality of each candidate key is addressed in the second normal way, so the focus here is to depend on nothing more than these keys.

For example. Imagine the table of cars with the columns placa (primary key), cor, nome_proprietário, endereço_proprietário:

plaque (PK) color landlord proprietary address
ABX-1234 blue Joseph Street X, 123
NNU-5566 green Marcos Example street, 5678
SGH-7210 black Maria Test Avenue, 7743
ERT-6902 red Joseph Street X, 123
BGH-5431 black Joseph Street X, 123

Note the owner’s address - it is a violation of the third normal way. Note that the owner’s address is set as a result of who owns it, not as a consequence of the car’s license plate. If José changes address and we update the address of only one of his cars, the database will be inconsistent (there is a change anomaly). If Maria buys another car and we add it with another address, it will also be inconsistent (insertion anomaly). If Marcos sells his car, your address will be forgotten (deletion anomaly). The solution again, is to separate into two tables:

Car:

plaque (PK) color proprietary
ABX-1234 blue 1
NNU-5566 green 2
SGH-7210 black 3
ERT-6902 red 1
BGH-5431 black 1

Owner:

code (PK) name address
1 Joseph Street X, 123
2 Marcos Example street, 5678
3 Maria Test Avenue, 7743

Boyce-Codd normal form (BCNF)

There is a normal form that is a little stronger than the normal third form, but that is not required to reach the fourth (or even the sixth). This is the normal form of Boyce-Codd (BCNF), also sometimes called 3.5NF or normally Boyce-Codd-Heath.

It is very rare to find cases of tables that are in the third normal form, but not in the normal form of Boyce-Codd. In addition, there are (unusual) cases where this normal form is impossible to achieve (unlike the forms between the first and the sixth, which can always be achieved).

The difference appears when there is more than one candidate key and they have some intersection. The idea is that the fields that are candidate keys always determine the other fields, and never be determined by them. Therefore, in the normal form of Boyce-Codd it is not allowed to arrive at a candidate key based on some other candidate key through functional dependencies.

Taking the example borrowed from a reply from Soen, let’s say that each pizza can have several different toppings, each of a different type and that you have two types of pizzas with these toppings:

Pizza Penthouse Type of cover
1 mozzarella cheese
1 pepperoni meat
1 olives vegetable
2 mozzarella meat
2 sausage cheese
2 pepper vegetable

There are two candidate keys there: Pizza and type of topping define the coverage used. Another candidate key is that from the pizza and the topping used, we can define the type of topping used.

Note that for whatever candidate key we look at, the other column depends on the whole candidate key and on nothing more than the candidate key, so the third normal form was reached.

Well, there’s something wrong there, because sausage is not a cheese and mozzarella cannot be meat and cheese at the same time - although we have reached the third normal form, we still have anomalies.

The solution is again split into two tables:

Pizza (PK) Coverage (PK)
1 mozzarella
1 pepperoni
1 olives
2 mozzarella
2 sausage
2 pepper
Coverage (PK) Type of coverage
mozzarella cheese
pepperoni meat
olives vegetable
sausage meat
pepper vegetable

Fourth normal form (4FN)

It is rare to find cases of tables that are in the third normal form, but not in the fourth. The fourth form concerns anomalies existing in the relation between different columns of the primary key, and only applies in tables with primary keys composed of three columns or more.

Tables representing many-to-many ternary, quaternary or n-ary relationships are places where it is worth taking a look at possible violations of 4FN.

For a table to be in the fourth normal form, it must first also be in the third normal form. The normal form of Boyce-Codd is not required.

It is kind of difficult to explain, but imagine that we have the case of commercial representatives acting in clients, where several representatives can act in several clients and we have service contracts in which several representatives act in several contracts. And then we have the following table, where all columns are primary key:

representative contract client
Geraldo 1 clothing store
Geraldo 1 hospital
Geraldo 2 clothing store
Geraldo 2 hospital
Marta 1 toy store
Marta 1 hospital
Marta 3 toy store
Marta 3 hospital
Louis 2 clothing store
Louis 2 toy store
Louis 4 clothing store
Louis 4 toy store

Note that we have many-to-many relationships between representatives and contracts and between representatives and customers. But we do not have between clients and contracts! Here is the violation of the normal way and the possibility of having some kind of anomaly. The ideal is to separate in two tables, one with the relationship between representatives and customers and the other with the representation between representatives and contracts:

representatives contracts
Geraldo 1
Geraldo 2
Marta 1
Marta 3
Louis 2
Louis 4
representatives customers
Geraldo clothing store
Geraldo hospital
Marta toy store
Marta hospital
Louis clothing store

Note that in the original table, we have redundant information. For example, the information that Marta is in contract 1 appears twice. If we deleted the Marta-1-hospital tuple, we would have an exclusion anomaly, since Marta works in the hospital (contract 3) and Marta is in contract 1 (in the toy store). We also have an insertion anomaly in case we have any representative allocated for any contract where there is no customer yet.

Fifth normal form (5FN)

The fifth normal form is more restricted than the fourth, and also applies to tables with 3 or more columns in the primary key. Again, for the fifth normal form to be attained, it is necessary to attain the fourth normal form first.

Let’s imagine a case similar to the one I used in the fourth normal way, but this time in addition to the relationship between representatives and customers and between representatives and contracts, we also have the relationship between customers and contracts. So we have this table:

representative contract client
Geraldo 1 clothing store
Geraldo 1 hospital
Geraldo 2 clothing store
Marta 1 hospital
Marta 3 toy store
Marta 3 hospital
Louis 2 clothing store
Louis 4 toy store

Note that this table is different from the one we use in the fourth normal form. There are some redundancies, for example she says twice that Geraldo is in the clothing store and that contract 1 is applied to the hospital. The technique used to put the fourth normal form does not apply, because for any two columns obtained from the key, there is a relationship (ie we have the contract-customer relationship too). Thus, this relationship is already in the fourth normal form, but there is still room for normalization. When we break it down into three different relationships, we come to the fifth normal form:

representatives contracts
Geraldo 1
Geraldo 2
Marta 1
Marta 3
Louis 2
Louis 4
representatives customers
Geraldo clothing store
Geraldo hospital
Marta toy store
Marta hospital
Louis clothing store
Louis toy store
contracts customers
1 clothing store
1 hospital
2 clothing store
3 toy store
3 hospital
4 toy store

Sixth normal form (6FN)

For a table to reach sixth normal form, it must first be in fifth normal form. It dictates that:

Every table can only have one or no column that is not part of the primary key.

Finding cases where the fifth normal form was reached but the sixth not is easy: Any table with two or more columns that are not primary key is a violation of the sixth normal form.

However, finding cases where the sixth normal form makes sense of being applied and brings some real benefit with it is quite rare - in most cases applying it ends up being crazy, because the application of this normal form tends to produce an explosion in the number of tables without bringing any or almost no benefit with it. The sixth normal form was not defined to be something useful in practice, it has more the purpose of being a theoretical limit that defines what is the final point from which no other normalization would be possible.

To see what the process would be like here (and also to doubt any real benefit it brings), imagine the table pessoa with the fields codigo (primary key), nome (NOT NULL), sexo (NOT NULL) and cor_favorita (NULLABLE), and have these records:

code (PK) name sex favorite color
1 Marcela F green
2 Rodolfo M -
3 James M yellow

She’d be decomposed like this:

Name-person:

code (PK) name
1 Marcela
2 Rodolfo
3 James

Person-sex:

code (PK) sex
1 F
2 M
3 M

Person-color-favorite:

code (PK) favorite color
1 green
3 yellow

Note that each column that was not part of the primary key ended up in a separate table. Also note that the table pessoa-cor-favorita does not have a record for element 2 (Rodolfo). This form of normalization eliminates the need to have NULLABLE in columns, and all resulting columns are NOT NULL. The cases of columns that were null simply ended up being omitted from the resulting records.

Normal form of key domain (DKNF)

The normal form of key domain is that which dictates that:

All database integrity restrictions shall be imposed either by key restrictions or by domain value restrictions.

Key restrictions mean primary keys and foreign keys. Domain values means the set of valid values for each given column.

This normal form is quite strong, being stronger than the fifth normal form and the normal form of Boyce-Codd together. However, it is no stronger than the sixth, since it allows the existence of several columns that are not part of the primary key in the same table and nor is the sixth stronger than it, since there is no mechanism in 6FN that ensures that all integrity restrictions are modelled as required by DKNF.

This normal form is nirvana, utopia, the ideal and perfect state of normalization. However, in practice, it’s almost impossible to achieve it because any database with business rules with some complexity will probably have some kind of consistency rule that doesn’t have to be modeled just as key or domain value constraints. However, even when unattainable, an effort to get close to it is valid for eliminating various possibilities of anomalies.

Other normal forms

There is a published algorithm (Barnstein’s algorithm) to normalize tables and take them to the third normal form. However, later research found that this algorithm is a little more rigid than the third normal form requires, but still not satisfying the normal Boyce-Codd form or the fourth normal form. For this reason, the normal form achieved by this algorithm was called elemental key normal form (EKNF).

Another existing normal form, stronger than the fourth normal form and the normal form of Boyce-Codd together, but weaker than the fifth normal form and the normal form of Boyce-Codd together is the normal form of essential tuple (ETNF). The authors who defined this normal form put it as an alternative to the standard fifth normal form, which although weaker, would be just as effective. The ETNF is reached when the BCNF is reached and at least one candidate key has only one field. The authors also refer to other works that define the normal form of superchaves (superkey normal form - SKNF), to redundancy-free normal form (Redundancy-free normal form - RFNF) and full key normal shape (key-complete normal form - KCNF). 5NF+BCNF is stronger than SKNF which is stronger than RFNF which is stronger than ETNF which is stronger than 4NF+BCNF. It is also shown that KCNF is equal to RFNF, although they have been defined by different people in different ways.

Ah, and of course, we also have the non-standard form (Unnormalized form), which is that form which does not fit even the first normal form.

How far to normalize?

In general, many are satisfied with reaching the third normal form and do not care much about the other normal forms higher than this because:

  • They apply to rare cases, because when the third normal form is reached, almost always the Boyce-Codd, the fourth and fifth were also by luck or accident. This is because it is difficult to have a case in 3NF that is not in BCNF and also because tables with three or more columns in the primary key and that have dependency relations between these columns are something very rare.

  • Since the third normal form is affected (and probably by luck or accident, that of Boyce-Codd, the fourth and the fifth too), there is very little scope for anomalies. Seeking to achieve the sixth normal form is in general crazy and does not bring any practical benefit. Seeking to achieve the normal form of key domain is almost always impossible, although this search may still reveal some possibility of anomaly that can be eliminated.

  • Sometimes, to improve the performance of the database or simplify its structure, the recommendation ends up being to denormalize some things. Many systems with a focus on Dataware house or business intelligence are standardised only until the second normal form, often being designed from structures that were already in the third normal form and which have undergone a denormalization.

  • In fact, the sixth normal form is more useful when thinking of denormalization rather than normalization. The idea would be to look for some of the tables that have 1-to-1 relationships and then unify them, thus effecting a denormalization. After all, when data pertaining to a record of a particular application domain concept is spread across several tables with 1-to-1 relationships, it is an indication that maybe they should be in the same table.

Where can I apply the concept of standardisation?

The concept of normalization and normal forms apply only to relational databases. Other types of databases that are not relational (for example, an application that saves data in files organized in folders), may also have analogous concepts of normalization to eliminate redundancies, improve the structure and reducethe possibility of anomalies occurring. However, in this field, the concepts of normalization are not as well defined as in the case of relational databases and each case will have its particularities.

The case of Nosql in particular is very interesting, because its idea is to model data that do not need to have a strong consistency (what is called eventual consistency), and often its data comes from places where there is not a very strong and well defined structure for the data. For this reason, it does not make so much sense to talk about normalization in Nosql databases as it does with relational SQL, because the idea of Nosql is to accept, tolerate and know how to deal with any anomalies and be able to manage even if they arise. Nosql databases sacrifice consistency in exchange for scalability, and for this reason, anomalies have to be tolerated. Despite this, there are still some concepts regarding normalization in Nosql, but they are not as well defined or matured. See a little bit about this here.

Data standardisation and database standardisation

To close the answer to this question, database normalization is the standardisation of the database structure, while data normalization corresponds to the normalisation of data already existing in the tables.

However, especially in the case of database restructurings that are already in production, the two concepts go so far together and mixed that it doesn’t even make sense to talk about one of them without also talking about the other. Hence, in many situations that happen in practice, they end up being placed as if they were synonyms.

  • 3

    With this "incomplete", I already understood the concept better. Thank you so much for spending your precious time +1

  • @Wallacemaxters I think now my answer is ready. :)

  • Just one question, in the 1NF example, if the table had 4 telephone fields (phone1, phone2, ...), would solve the problem? It was not clear to me whether it is a column with multiple values, or also multiple "equal" columns. In case it is not a problem in 1NF, it would only become one in 6NF?

  • 1

    @Guilhermecostamilam Resolve in quotes. This is a very rough, limited and tricky way of pretending that it is not a multivariate field, but serves to circumvent/trick 1FN. The multi-valued field definition is a field with multiple values and not multiple fields to represent multiple values (which I think is even worse than not being in 1FN, or at least as bad as)...

  • 1

    @Guilhermecostamilam ... In this case, you can even reach 6FN keeping this gambiarra. This shows that standardisation alone is not enough to create a good model, because it is perfectly possible to have a horrific and inadequate and even standardised model.

  • @Victorstafusa great answer +1, but I’m having a question, maybe stupid, according to 2 FN. This statement: "If I have a primary key of a single field, that is, not composed, the relation will already be in 2 FN" is correct? Because I always read in 2FN with a composite primary key, so I deduced the above statement as correct, but I can’t find any bibliography stating this, could help me?

  • @Luizaugusto The statement is false. Just have something like id_carro being the simple primary key and you put as other columns placa, cor, id_proprietario and nome_proprietario. The field nome_proprietario depends on id_proprietario and not of id_carro, so this is a violation of 2FN.

Show 2 more comments

13

Data normalization is a set of rules applied to relational database tables in order to maintain data consistency, avoid duplication/redundancy, and problems with removals or updates of records.

The normal forms are 1FN, 2FN, 3FN, BCNF, 4FN and 5FN. Tables are usually normalized up to the third form, the fourth and fifth normal forms deal with specific problems.

  • 1

    What is 1fn, 2fn, 3fn? hehehe

  • @Wallacemaxters 1 Normal Form, is a sequential process, for example a table to be in 3FN must be obligatorily in 2FN which in turn must obey 1FN.

  • http://www.luis.blog.br/normalizacao-de-dados-e-as-formas-normais.aspx

  • Ah, FN means Normal Form? Now I’m starting to pick up.

  • https://en.wikipedia.org/wiki/Sixth_normal_form - https://en.wikipedia.org/wiki/Domain-key_normal_form

  • Simple and easy. .

Show 1 more comment

13

The question already has a very good answer, almost a chapter of a book giving all the details, but I missed something important in it that I realize that many people confuse or do not understand why normalization is necessary, and so has a form "summarized" to understand the subject without having to see all the normal forms.

Denormalization undue

Especially in Nosql times (particularly in document model) I see normalization being unduly attacked, often by people who do not understand it (I will disregard bad faith to try to give a reason for your preferred technology to exist and be ubiquitous, which Nosql is not and should not be).

Nor am I saying that even if standardisation is appropriate it should always be applied, just make it clear that at the moment that it does not apply you will have the burden of dealing with it. A rule being ignored with awareness can be useful. One of the problems of the document model is that it "requires" the denormalization in "everything", then the rule happens to be to break the rule and the exception do the appropriate.

Canonicality

Problems that do not require standardisation should not be standardised, but what should be, almost always should for a basic reason, it is necessary that an information is canonical, that is, it should only exist in one place of your database. It does not mean that there can be no repetition, this is a common mistake. I’ll explain that in the question DRY is to avoid redundancies, right? which is a concept misunderstood by most people thinking that the problem is repetition. The same that goes for the code goes for the data, it is important to have a unique reference on that knowledge.

Every time the same knowledge is replicated elsewhere your code has to be responsible for administering this replica. Is this what you wish for your code, to have to deal with that burden? What happens if it (you) fails in that? Inconsistency. You can have the same information modified in one place and not in the other, and you have the same information in two different states and depending on who you access one or the other will show something inconsistent.

Note that there are cases that the same information at a given time needs to be frozen and becomes other information, when this occurs no matter the repetition, on the contrary, it should even repeat, because this repetition is circumstantial, at some point potentially they will be different and they should be, because they have come to represent something different.

Just because the data is the same in two places does not mean it is the same information. Just as there are two José da Silva that are the same person. Or if two twin brothers univitellinos are the same person. Repetition is not the reason to normalize, be canonical is.

False premise

A common mistake I see around, and it must be the fault of teachers and books who taught wrong, is to say that normalization serves to reduce space consumed. This is only a pleasant and always desirable side effect, but it is not the reason to normalize, the reason is the canonicality of information.

Nor is it correct to say that space today is no problem. Taking up more space makes you access more mass storage (HDD or SSD) and have less data in performance storage (RAM), this is detrimental to performance.

Of course normalizing can bring some performance difficulties in some situations as well, but you have to think about whether it is worth exchanging the difficulty of keeping the data and the risk of inconsistency for a slightly better performance.

Interestingly to denormalization can help avoid JOIN reading, but if used in exaggeration can create a kind of JOIN writing which is much worse facilitating creating deadlocks and other harmful phenomena.

Hardware Evolution (NVRAM) will make the cost of making a JOIN derisory and despicable. Nothing will evolve to have to deal well with non-canonical information.

Correct normalization

Understand that I am not discussing the wrong normalization. Today in all relational databases mainstream has ways to use flexible data models that can avoid unnecessary normalization.

It is interesting to take a look at What is the difference between Association, Aggregation and Composition in OOP?. Generally cases of composition should not be standardised, in databases this is easier and better to do because of the way you access data (unless the access is in fixed line size, rare nowadays). Aggregation is clear case for normalisation. The association would also be, but there are cases that it is possible to transform it into composition for database purposes, so we need to look at this as optimization (aggregation can also, but is more rare).

One of the reasons I criticize microservices is that it destroys standardization in the name of an architecture is almost always not necessary. Instead of having a system with canonical information one has several subsystems each with their canonical information (possibly), but on the whole violates the DRY completely and becomes a nightmare take care of it.

Beware of the rule by the rule, all of them can and should be broken if it is more important to do so. To know when to break it takes quite a mastery of what you’re doing. Without breaking rules you will already be doing wrong by definition, and only hit by coincidence. Always remembering:

Fiat 147 todo detonado andando pelas ruas

You can tolerate anomaly in your data? Nosql preaches it.

Having these things clearer, re-read Victor Stafusa’s response by paying more attention to her details. I had paid close attention to this phrase?

The main purpose of the standardisation process is to eliminate insertion, update and exclusion anomalies.

Beyond the normal forms

One point that the answer does not touch on is that standardisation is not so obvious. The most common example I always quote is that you have a customer and a supplier, but both are the same entity, if changing the address has to change in two different places. Or it should have only one entity and treat only the customer relationship or supply as something separate (normalized or not) or if it does not want to normalize right (a normal form not existing in the books on the subject) should ensure that changing in the customer table changes in the supplier table. Virtually no system I know does that right. Interestingly it’s easier to do this in the document model, but because almost everyone who adopts this model still thinks like in relational they don’t do it right either.

Learn modeling and normalize

Data modeling is probably the most important feature a developer should have, and most, even experienced, are pretty bad at it (I’m still 36 years into it). With that in mind start preparing.

It needs to have an enormous capacity for interpreting text, and a text that doesn’t even exist, and mathematical understanding, especially sets, knows what it saw from 3a. grade (in my time) and that almost nobody cares and that many teachers do not even understand why they are teaching it? If these things are missing, start all over again, with nothing standing.

Examples of standardisation

I’ve already given several Reviews about modeling and normalization. Reading these examples can help gain experience. Just don’t take these things as a rule. There are other people, too, but it’s harder for me to find.

It amazes me that this question has only about 4,000 views up to the time I wrote this answer, which indicates a lot because it has so many problematic applications out there, people aren’t concerned about the right things.

  • 1

    Alternatives to the (anti-)standard Entity-Attribute-Value I think it’s a complement or an example in parts of your answer.

  • "One of the reasons I criticize microservices is that normalization is destroyed in the name of an almost always unnecessary architecture." - I am very far from being a great expert on microservices, but as far as I know, only a poorly made or poorly planned microservice architecture would actually be destroying normalization probably because it is sacrificing canonicity. Or at least this would be a trade-off where it sacrifices itself normalization to gain performance, something analogous to a denormalization, and would only be valid if done consciously and well-thought.

  • Maybe, but then I never saw one done right. There is a way to do and maintain normalization, each ms take care of a table, which would be one of the silliest things I’ve ever seen. When you do a ms it involves everything he needs. Then in another ms it involves everything he needs in isolation, only that some of the things he has already had in the other, only it would not have if he depended on the other ms to give this information in isolation and the ms do the relationship, what I’ve seen some do, ie the ms together have seen "databases". I’d love to know more about what you think about this and who knows how to review it.

  • I’m not sure, because I don’t know enough about it. But the idea of microservices is to divide the domain into small pieces that can be managed, evolved, and reused separately. If this is done wrong, either by default or by excess, it will go wrong. So I think what counts is to look at the purpose of each microservice from a top-down perspective, apply the YAGNI to the underlying BD and replace what would be some of the foreign keys of the monolithic model with Urls or external codes, which is already the case with CPF or CEP, for example.

  • @Victorstafusa doesn’t know as much because I never have and I never will. It is used for two reasons: scale (the OS is one of the 50 largest sites and does not need to) and comply with Conway’s law (</sarcasm>). But that’s where the database comes in. Everyone says there is no ms if you have a monolithic DB, it only exists if each ms has its own DB. Then you "normalize" too much, which ends up being a universal malpractice. Each DB has to have everything it needs to work because it cannot consult others, there is no duplication in it, but it has in the whole universe.

  • Or each ms is a table and it responds on it and then it is well normalized, IE, ms becomes an API to access that table and the results are all mounted in the application, a total insanity. Not that the duplicate form doesn’t happen. And that is the problem ms mo Nosql where the guy throws all responsibility to the application take care of everything he does not do, alias almost all ms uses Nosql and in fact it makes perfect sense, ms is that does not. I love responsibility in the application, but to a certain extent, it’s to define darlings, validate and do macro processing.

  • You mentioned 2 examples of natural ms. Login is usually another. The problem is to do an ERP with ms where almost all tables are related if you do 2 or 3 degrees of separation. There are some things that are easy to do because they are isolated things, they are things that work as services and they are small. I want to see you talk about ms in whole systems. What you’re talking about is not ms architecture is a microservice being created because it’s a natural ms, it’s different from transforming something that would be a monolith in ms. Explain to me how to do "right" in an entire ERP.

  • @Victorstafusa see only https://www.eximiaco.tech/pt/2019/08/12/event-carried-state-transfer/ all descriptions I see preaching microservices speak the same thing (5o. paragraph with bold)

  • @I’m trying to write a question here at Sopt to clarify that. But first, I will collect the pertinent material, I will see if by the weekend I complete it (or I give up). But I’m also a little afraid to do it because questions at this level of depth are very difficult to answer with anything other than what we already have.

  • I could not elaborate on that question, but I found a very interesting and very new link (from three days ago): https://dzone.com/articles/breaking-the-monolithic-database-in-your-microserv

  • @Victorstafuses everything you do right works out and if you do wrong works out, the difference is that some things are easier to do right and some things are easier to do wrong. For me, until some convince me otherwise ms it is easier to go wrong precisely by losing the canonicality. The only thing this article talks about, and obviousness to those who know ms is that you should use bounded context, which destroys the canonicality of information, my biggest criticism of the DDD. At no time does he show that can maintain the canonicality. Ñ has how,.challenging

  • There is a way to really maintain canonicality with nanoservices, you do the service just do an operation to take the information in the database and all processing is done in the application, I know there are people who do this, and it’s one of the dumbest things I’ve ever seen in my life http://wiki.c2.com/? ReinventingTheDatabaseInApplication.

Show 7 more comments

0

The normalization is important and should be carried up to 3FN, but for performance reasons it may also be necessary to proceed to denormalization. Sometimes, after making the data model all cute, we do the load test and the database arrrrraaasstaaa-se. It may be necessary to reverse some of the normalization ;-) Common sense!

  • I agree with your point of view even though many people have denied your answer. Database has to be analyzed together with business. There is no rule that cannot be "broken".

Browser other questions tagged

You are not signed in. Login or sign up in order to post.