What is normalization?
Database normalization is the process of transformations in the structure of a database that aims to eliminate redundancies and eliminate insertion, update and deletion anomalies.
When performing the normalization process, the data registered in the database will be better organized and most of the time will also occupy less physical space. However, the normalization process also always increases the number of tables and in many cases can be a difficult task to perform. In addition, standard databases beyond what is necessary may have poor performance and/or excessive complexity.
Anomalies
The main purpose of the standardisation process is to eliminate insertion, update and exclusion anomalies. The anomaly occurs when there is no way to register certain information without some other information being directly registered. For example, imagine you have a table funcionário
with the following data: codigo
, nome
, projeto
, where the column projeto
corresponds to the name of the project in which an employee was allocated. And then you have the following data:
code |
name |
project |
1 |
Peter |
Sales |
2 |
Maria |
Sales |
3 |
Carlos |
Customer registration |
And then came a new project: The issue of tax bills. How do you register this new project? The answer is that you can not register, because to do this you would have to have some employee in this project - IE, we have an insertion anomaly.
If in the previous example, the employee Carlos was disconnected from the company and removed from the table, the information about the client registration project is lost. This is an unwanted side effect - it is the exclusion anomaly. If, however, it were only moved to the new bill bill, we would also lose the information about the existence of the customer registration project - this is the anomaly of change.
The problem that causes these anomalies is that the project information is all within the employee table, which is not its place. If we have two related tables (1-to-N) - employees and projects - the anomalies disappear.
Anomalies also relate to the concept of referential integrity.
First normal form (1NF)
The first rule to eliminate anomalies is:
There shall be no multivariate columns.
What is a multivariate column? It is a column in which it is possible to store more than one value per record. For example, imagine that you have the columns in your client table codigo
, nome
and telefones
, thus filled in:
code |
name |
telephones |
1 |
Paul |
99567-4289, 3605-4900 |
2 |
Maria |
97223-0444 |
3 |
Alan |
- |
4 |
Juliana |
2853-0266, 96610-5480, 2410-9941 |
Note that the phone column is multi-valued. Inserting, updating and deleting phones in this scheme is a tricky thing to do. Searching for a specific phone number is also tricky. The solution to this is to split into two tables:
Client:
code |
name |
1 |
Paul |
2 |
Maria |
3 |
Alan |
4 |
Juliana |
Telephone:
code |
client code |
number |
1 |
1 |
99567-4289 |
2 |
1 |
3605-4900 |
3 |
2 |
97223-0444 |
4 |
4 |
2853-0266 |
5 |
4 |
96610-5480 |
6 |
4 |
2410-9941 |
By eliminating all multivariate columns, the database reaches a structural form called normal first form, or simply 1FN.
Candidate keys and superchaves
Before proceeding with the other normal forms, it is necessary to introduce the concepts of candidate keys and superchaves.
The primary key (Primary key, or just PK) is that set of columns that serves to identify the tuple in a unique way (can be only one column, or can be two or more). It is important that the database designer knows how to identify which columns are most appropriate to be chosen as part of the primary key. Often, a column with a sequential numeric code may serve to be the primary key, but there are several situations where this is not the case.
Sometimes there is more than one set of columns that could be primary key. Each of these sets is called candidate key. For example, in a table Pessoa
that has the fields CPF
, RG
, Estado
and Nome
, both the CPF
as to the RG
along with the Estado
are candidate keys. Thus, it is possible to reach the Nome
from the CPF
, but it is also possible to achieve Nome
from the RG
and of Estado
.
Any set of columns that has a subset, a candidate key is called a superchave.
Second normal form (2FN)
The second normal form is the one that says:
All columns must have functional dependency with the totality of each candidate key.
In most cases by "each candidate key", is understood by "with the primary key", unless there is more than one candidate key.
Also, for a table to be in the second normal form, it must be in the first normal form.
A column is in functional dependence with the primary key when it is determined in the field of application through the primary key. A column has no functional dependency with the primary key when it is defined independently of the primary key or when it is defined from some other column than the primary key.
A functional dependency can be said to be in the totality of the primary key when all fields of the primary key are necessary to establish the dependency relation. In case the primary key is composed, it is possible to have a partial dependence.
To give an example of 2FN, imagine that your company has sales representatives acting on customers and wants to represent the relationship of which sales representatives act on which customers. On this table (let’s call representação
), we have the columns nome_cli
for the customer name and nome_repr
for the representative name as primary keys and we also have the columns end_repr
to the address of the representative, end_cli
to the customer’s address and valor_contrato
. That is to say:
Depiction:
nome_cli (PK) |
name_repr (PK) |
end_repr |
end_cli |
contract value |
Louis |
Cleyton |
888 Street |
X Street 123 |
R$ 500,00 |
Paula |
Lara |
Rue B 999 |
Rue Y 345 |
R$ 900,00 |
Paula |
Cleyton |
888 Street |
Rue Y 345 |
R$ 650,00 |
Note that the columns end_repr
and end_cli
depend on the primary key. But they do not depend on all the primary key, each one depends only on part of it. The solution in this case is to have a table of customers (with the customer’s address), a table of commercial representatives (with his address as well) and to leave in the table of action the respective foreign keys with the value of the contract. That is to say:
Customers:
nome_cli (PK) |
end_cli |
Louis |
X Street 123 |
Paula |
Rue Y 345 |
Representatives:
name_repr (PK) |
end_repr |
Cleyton |
888 Street |
Lara |
Rue B 999 |
Depiction:
nome_cli (PK) |
name_repr (PK) |
contract value |
Louis |
Cleyton |
R$ 500,00 |
Paula |
Lara |
R$ 900,00 |
Paula |
Cleyton |
R$ 650,00 |
Third normal form (3FN)
The third normal form is the one that says:
All columns must have functional dependency with the totality of each candidate key and nothing more than these candidate keys.
Again, in most cases by "each candidate key", is understood by "with the primary key", unless there is more than one candidate key. If the only existing candidate key is the primary key, this would look like this:
All columns must have functional dependency with the whole primary key and nothing more than the primary key.
Moreover, for a table to be in the third normal form, it must be first in the second normal form (and also in the first). The part of depending on the totality of each candidate key is addressed in the second normal way, so the focus here is to depend on nothing more than these keys.
For example. Imagine the table of cars with the columns placa
(primary key), cor
, nome_proprietário
, endereço_proprietário
:
plaque (PK) |
color |
landlord |
proprietary address |
ABX-1234 |
blue |
Joseph |
Street X, 123 |
NNU-5566 |
green |
Marcos |
Example street, 5678 |
SGH-7210 |
black |
Maria |
Test Avenue, 7743 |
ERT-6902 |
red |
Joseph |
Street X, 123 |
BGH-5431 |
black |
Joseph |
Street X, 123 |
Note the owner’s address - it is a violation of the third normal way. Note that the owner’s address is set as a result of who owns it, not as a consequence of the car’s license plate. If José changes address and we update the address of only one of his cars, the database will be inconsistent (there is a change anomaly). If Maria buys another car and we add it with another address, it will also be inconsistent (insertion anomaly). If Marcos sells his car, your address will be forgotten (deletion anomaly). The solution again, is to separate into two tables:
Car:
plaque (PK) |
color |
proprietary |
ABX-1234 |
blue |
1 |
NNU-5566 |
green |
2 |
SGH-7210 |
black |
3 |
ERT-6902 |
red |
1 |
BGH-5431 |
black |
1 |
Owner:
code (PK) |
name |
address |
1 |
Joseph |
Street X, 123 |
2 |
Marcos |
Example street, 5678 |
3 |
Maria |
Test Avenue, 7743 |
Boyce-Codd normal form (BCNF)
There is a normal form that is a little stronger than the normal third form, but that is not required to reach the fourth (or even the sixth). This is the normal form of Boyce-Codd (BCNF), also sometimes called 3.5NF or normally Boyce-Codd-Heath.
It is very rare to find cases of tables that are in the third normal form, but not in the normal form of Boyce-Codd. In addition, there are (unusual) cases where this normal form is impossible to achieve (unlike the forms between the first and the sixth, which can always be achieved).
The difference appears when there is more than one candidate key and they have some intersection. The idea is that the fields that are candidate keys always determine the other fields, and never be determined by them. Therefore, in the normal form of Boyce-Codd it is not allowed to arrive at a candidate key based on some other candidate key through functional dependencies.
Taking the example borrowed from a reply from Soen, let’s say that each pizza can have several different toppings, each of a different type and that you have two types of pizzas with these toppings:
Pizza |
Penthouse |
Type of cover |
1 |
mozzarella |
cheese |
1 |
pepperoni |
meat |
1 |
olives |
vegetable |
2 |
mozzarella |
meat |
2 |
sausage |
cheese |
2 |
pepper |
vegetable |
There are two candidate keys there: Pizza and type of topping define the coverage used. Another candidate key is that from the pizza and the topping used, we can define the type of topping used.
Note that for whatever candidate key we look at, the other column depends on the whole candidate key and on nothing more than the candidate key, so the third normal form was reached.
Well, there’s something wrong there, because sausage is not a cheese and mozzarella cannot be meat and cheese at the same time - although we have reached the third normal form, we still have anomalies.
The solution is again split into two tables:
Pizza (PK) |
Coverage (PK) |
1 |
mozzarella |
1 |
pepperoni |
1 |
olives |
2 |
mozzarella |
2 |
sausage |
2 |
pepper |
Coverage (PK) |
Type of coverage |
mozzarella |
cheese |
pepperoni |
meat |
olives |
vegetable |
sausage |
meat |
pepper |
vegetable |
Fourth normal form (4FN)
It is rare to find cases of tables that are in the third normal form, but not in the fourth. The fourth form concerns anomalies existing in the relation between different columns of the primary key, and only applies in tables with primary keys composed of three columns or more.
Tables representing many-to-many ternary, quaternary or n-ary relationships are places where it is worth taking a look at possible violations of 4FN.
For a table to be in the fourth normal form, it must first also be in the third normal form. The normal form of Boyce-Codd is not required.
It is kind of difficult to explain, but imagine that we have the case of commercial representatives acting in clients, where several representatives can act in several clients and we have service contracts in which several representatives act in several contracts. And then we have the following table, where all columns are primary key:
representative |
contract |
client |
Geraldo |
1 |
clothing store |
Geraldo |
1 |
hospital |
Geraldo |
2 |
clothing store |
Geraldo |
2 |
hospital |
Marta |
1 |
toy store |
Marta |
1 |
hospital |
Marta |
3 |
toy store |
Marta |
3 |
hospital |
Louis |
2 |
clothing store |
Louis |
2 |
toy store |
Louis |
4 |
clothing store |
Louis |
4 |
toy store |
Note that we have many-to-many relationships between representatives and contracts and between representatives and customers. But we do not have between clients and contracts! Here is the violation of the normal way and the possibility of having some kind of anomaly. The ideal is to separate in two tables, one with the relationship between representatives and customers and the other with the representation between representatives and contracts:
representatives |
contracts |
Geraldo |
1 |
Geraldo |
2 |
Marta |
1 |
Marta |
3 |
Louis |
2 |
Louis |
4 |
representatives |
customers |
Geraldo |
clothing store |
Geraldo |
hospital |
Marta |
toy store |
Marta |
hospital |
Louis |
clothing store |
Note that in the original table, we have redundant information. For example, the information that Marta is in contract 1 appears twice. If we deleted the Marta-1-hospital tuple, we would have an exclusion anomaly, since Marta works in the hospital (contract 3) and Marta is in contract 1 (in the toy store). We also have an insertion anomaly in case we have any representative allocated for any contract where there is no customer yet.
Fifth normal form (5FN)
The fifth normal form is more restricted than the fourth, and also applies to tables with 3 or more columns in the primary key. Again, for the fifth normal form to be attained, it is necessary to attain the fourth normal form first.
Let’s imagine a case similar to the one I used in the fourth normal way, but this time in addition to the relationship between representatives and customers and between representatives and contracts, we also have the relationship between customers and contracts. So we have this table:
representative |
contract |
client |
Geraldo |
1 |
clothing store |
Geraldo |
1 |
hospital |
Geraldo |
2 |
clothing store |
Marta |
1 |
hospital |
Marta |
3 |
toy store |
Marta |
3 |
hospital |
Louis |
2 |
clothing store |
Louis |
4 |
toy store |
Note that this table is different from the one we use in the fourth normal form. There are some redundancies, for example she says twice that Geraldo is in the clothing store and that contract 1 is applied to the hospital. The technique used to put the fourth normal form does not apply, because for any two columns obtained from the key, there is a relationship (ie we have the contract-customer relationship too). Thus, this relationship is already in the fourth normal form, but there is still room for normalization. When we break it down into three different relationships, we come to the fifth normal form:
representatives |
contracts |
Geraldo |
1 |
Geraldo |
2 |
Marta |
1 |
Marta |
3 |
Louis |
2 |
Louis |
4 |
representatives |
customers |
Geraldo |
clothing store |
Geraldo |
hospital |
Marta |
toy store |
Marta |
hospital |
Louis |
clothing store |
Louis |
toy store |
contracts |
customers |
1 |
clothing store |
1 |
hospital |
2 |
clothing store |
3 |
toy store |
3 |
hospital |
4 |
toy store |
Sixth normal form (6FN)
For a table to reach sixth normal form, it must first be in fifth normal form. It dictates that:
Every table can only have one or no column that is not part of the primary key.
Finding cases where the fifth normal form was reached but the sixth not is easy: Any table with two or more columns that are not primary key is a violation of the sixth normal form.
However, finding cases where the sixth normal form makes sense of being applied and brings some real benefit with it is quite rare - in most cases applying it ends up being crazy, because the application of this normal form tends to produce an explosion in the number of tables without bringing any or almost no benefit with it. The sixth normal form was not defined to be something useful in practice, it has more the purpose of being a theoretical limit that defines what is the final point from which no other normalization would be possible.
To see what the process would be like here (and also to doubt any real benefit it brings), imagine the table pessoa
with the fields codigo
(primary key), nome
(NOT NULL
), sexo
(NOT NULL
) and cor_favorita
(NULLABLE
), and have these records:
code (PK) |
name |
sex |
favorite color |
1 |
Marcela |
F |
green |
2 |
Rodolfo |
M |
- |
3 |
James |
M |
yellow |
She’d be decomposed like this:
Name-person:
code (PK) |
name |
1 |
Marcela |
2 |
Rodolfo |
3 |
James |
Person-sex:
code (PK) |
sex |
1 |
F |
2 |
M |
3 |
M |
Person-color-favorite:
code (PK) |
favorite color |
1 |
green |
3 |
yellow |
Note that each column that was not part of the primary key ended up in a separate table. Also note that the table pessoa-cor-favorita
does not have a record for element 2 (Rodolfo). This form of normalization eliminates the need to have NULLABLE
in columns, and all resulting columns are NOT NULL
. The cases of columns that were null simply ended up being omitted from the resulting records.
Normal form of key domain (DKNF)
The normal form of key domain is that which dictates that:
All database integrity restrictions shall be imposed either by key restrictions or by domain value restrictions.
Key restrictions mean primary keys and foreign keys. Domain values means the set of valid values for each given column.
This normal form is quite strong, being stronger than the fifth normal form and the normal form of Boyce-Codd together. However, it is no stronger than the sixth, since it allows the existence of several columns that are not part of the primary key in the same table and nor is the sixth stronger than it, since there is no mechanism in 6FN that ensures that all integrity restrictions are modelled as required by DKNF.
This normal form is nirvana, utopia, the ideal and perfect state of normalization. However, in practice, it’s almost impossible to achieve it because any database with business rules with some complexity will probably have some kind of consistency rule that doesn’t have to be modeled just as key or domain value constraints. However, even when unattainable, an effort to get close to it is valid for eliminating various possibilities of anomalies.
Other normal forms
There is a published algorithm (Barnstein’s algorithm) to normalize tables and take them to the third normal form. However, later research found that this algorithm is a little more rigid than the third normal form requires, but still not satisfying the normal Boyce-Codd form or the fourth normal form. For this reason, the normal form achieved by this algorithm was called elemental key normal form (EKNF).
Another existing normal form, stronger than the fourth normal form and the normal form of Boyce-Codd together, but weaker than the fifth normal form and the normal form of Boyce-Codd together is the normal form of essential tuple (ETNF). The authors who defined this normal form put it as an alternative to the standard fifth normal form, which although weaker, would be just as effective. The ETNF is reached when the BCNF is reached and at least one candidate key has only one field. The authors also refer to other works that define the normal form of superchaves (superkey normal form - SKNF), to redundancy-free normal form (Redundancy-free normal form - RFNF) and full key normal shape (key-complete normal form - KCNF). 5NF+BCNF is stronger than SKNF which is stronger than RFNF which is stronger than ETNF which is stronger than 4NF+BCNF. It is also shown that KCNF is equal to RFNF, although they have been defined by different people in different ways.
Ah, and of course, we also have the non-standard form (Unnormalized form), which is that form which does not fit even the first normal form.
How far to normalize?
In general, many are satisfied with reaching the third normal form and do not care much about the other normal forms higher than this because:
They apply to rare cases, because when the third normal form is reached, almost always the Boyce-Codd, the fourth and fifth were also by luck or accident. This is because it is difficult to have a case in 3NF that is not in BCNF and also because tables with three or more columns in the primary key and that have dependency relations between these columns are something very rare.
Since the third normal form is affected (and probably by luck or accident, that of Boyce-Codd, the fourth and the fifth too), there is very little scope for anomalies. Seeking to achieve the sixth normal form is in general crazy and does not bring any practical benefit. Seeking to achieve the normal form of key domain is almost always impossible, although this search may still reveal some possibility of anomaly that can be eliminated.
Sometimes, to improve the performance of the database or simplify its structure, the recommendation ends up being to denormalize some things. Many systems with a focus on Dataware house or business intelligence are standardised only until the second normal form, often being designed from structures that were already in the third normal form and which have undergone a denormalization.
In fact, the sixth normal form is more useful when thinking of denormalization rather than normalization. The idea would be to look for some of the tables that have 1-to-1 relationships and then unify them, thus effecting a denormalization. After all, when data pertaining to a record of a particular application domain concept is spread across several tables with 1-to-1 relationships, it is an indication that maybe they should be in the same table.
Where can I apply the concept of standardisation?
The concept of normalization and normal forms apply only to relational databases. Other types of databases that are not relational (for example, an application that saves data in files organized in folders), may also have analogous concepts of normalization to eliminate redundancies, improve the structure and reducethe possibility of anomalies occurring. However, in this field, the concepts of normalization are not as well defined as in the case of relational databases and each case will have its particularities.
The case of Nosql in particular is very interesting, because its idea is to model data that do not need to have a strong consistency (what is called eventual consistency), and often its data comes from places where there is not a very strong and well defined structure for the data. For this reason, it does not make so much sense to talk about normalization in Nosql databases as it does with relational SQL, because the idea of Nosql is to accept, tolerate and know how to deal with any anomalies and be able to manage even if they arise. Nosql databases sacrifice consistency in exchange for scalability, and for this reason, anomalies have to be tolerated. Despite this, there are still some concepts regarding normalization in Nosql, but they are not as well defined or matured. See a little bit about this here.
Data standardisation and database standardisation
To close the answer to this question, database normalization is the standardisation of the database structure, while data normalization corresponds to the normalisation of data already existing in the tables.
However, especially in the case of database restructurings that are already in production, the two concepts go so far together and mixed that it doesn’t even make sense to talk about one of them without also talking about the other. Hence, in many situations that happen in practice, they end up being placed as if they were synonyms.
Related: http://answall.com/q/22069/101. In a certain way it is to do a DRY in the database: http://answall.com/q/120931/101
– Maniero
This is a community where the questions work in a way that helps not only my person (who wanted to answer those questions) but other people as well. Therefore, I suggest to those who voted 'no' to show something defective in the question, so that I can correct and thus help those who might be interested in this question in the future.
– Wallace Maxters