Database redundancies are always undesirable?

Asked

Viewed 2,560 times

12

I am developing a system that keeps a record of all financial transactions occurring in a bank account (simple matches). There is a redundancy in my data model: a table saldo_conta keeps a list of dates and balances for a particular account, and another table movimentacao_financeira maintains a list of entries and exits for a particular account. These tables are redundant, because the information of one could be computed from the data of the other (in fact, there is a small difference, as will be explained below).

However, this redundancy has proved useful in the past: due to a bug in the user interface, several releases have been logged with the wrong value. But the bug did not affect the other table, so I got two tables with historical discrepancies. Once determined which of them was correct (and after fixing the bug, of course), it was a simple matter to execute a script to recalculate the incorrect table values based on the other.

Despite this, I am not 100% convinced that maintaining this redundancy is the most correct way to do the modeling. I want my system to be robust, and I don’t know how I would make this error recovery if it wasn’t for redundancy (logs don’t keep detailed information about the releases), I think I would have to manually check each of them with bank reports attached [in image format]. Even so, I have this "preconceived notion" that it is a bad practice, only that I lack objective arguments to guide my decision.

Is it always bad to maintain redundancies in a relational data model? And if the answer is "yes", would there be an alternative way to prevent bugs that might affect the most important data? (i.e. simplify recovery in case of errors)


Updating: a little more than background in my specific case - the table movimentacao_financeira contains a list of transactions initiated but not necessarily confirmed. An example would be the issue of a check: the money has already been "spent" (in the sense that it is already provisioned, can not be used for anything else), but while the check is not cashed it is not yet in the bank records. Already the table saldo_conta records the balances already reconciled, that is, what appears in the statement. There is a relationship N x 1 between these tables - as several individual transactions can be confirmed by a single bank conciliation.

The redundancy is in storing the balance in this second table - since it could be computed as the sum of all transactions already confirmed until then. However, there is a semantic difference: the balance recorded in this table must be exactly the balance recorded on the bank statement - if the computed value (i.e. sum of the previous transactions) is different of what is in the extract, something is wrong, and it is necessary to draw attention to the fact (this verification is made during conciliation, but if in the future new releases are recorded in the past discrepancies can be introduced - and without means of detecting them [if the redundant value is not present]).

3 answers

11


Data redundancy is permissible in two cases:

Performance

It is common in a database designed for large loads to have some redundant data, so that some data are avoided JOIN's in SELECT's, see an example:

Imagine we’re mounting a clone of StackOverflow, a response table could also contain the data of the user who responded, such as your nome and username then when we set up the user interface. we would not need to do an additional select in the users table to show the answers, only a base table, the table respostas. Of course this is an illustrative example, it’s not very viable in the real world. but the idea is basically this.

Scalability

In voluminous data scenarios, it is common to break the standard database normalization so that we can make the system viable, basically the above example fits this context, but let’s see another.

Imagine that we should show the history of bank movement, the same is called historico and account data are stored in a call table contas, following the normalization rules we would design something like

Table conta

id - inteiro / auto incremento
numero_conta - inteiro
dv - inteiro
nome_correntista - varchar
sobrenome_correntista - varchar
cpf = varchar

Table historico

id - inteiro
debito_credito - booleano
valor - double
conta_id - inteiro

To display transactions, displaying account number and drive, we would do the following query:

SELECT historico.id, 
       historico.debito_credito, 
       historico.valor, 
       contas.numero,
       contas.dv
FROM historico JOIN contas on contas.id = historico.conta_id

Note that we use two tables, and in a distributed system it would take a little time to load, if we have the data redundancy conta.numero and conta.dv we could do this SELECT in just one table:

Structure of historico

id - inteiro
debito_credito - boolean
valor - double
conta_id - inteiro - chave estrangeira
conta_numero - inteiro
conta_dv - inteiro 

Then we could query as follows:

SELECT 
    id,
    debito_credito,
    valor,
    conta_id,
    conta_numero,
    conta_dv
FROM
    historico

And then we’d have a significant performance gain.

Data security

No, data redundancy in tables is not a way to keep your data safe, the case described was an exception, in a production database you should have techniques and tools that make consistent, versioned backups and easy to recover in case of failures.


Note 1: Only immutable data should be redundant, because if the data changes in the main table, in our case contas data would be inconsistent in the table historico.

Data As account number and checker digit do not change, however the last name can be changed say by a marriage.

It’s not a rule, but the effort to sync later will be greater than the performance gain.


Note 2: The examples given were for didactic purposes, do not draw your bank with financial movements as described in this reply.

  • 2

    +1 I liked the observation about safety.

  • That’s a very good answer, but I think you’ve got me wrong about "data security". The problem is not "restoring the bank to a previous state", for this in fact the solution is periodic backups (and routinely verified, of course). What happened was that the data entered in one of the tables was incorrect, and this was only detected days later (because it took time for discrepancy between balances to become apparent). In this scenario, the question remains - how to recover this error, plus manually check each transaction.

  • 2

    @mgibsonbr the question is that the data does not follow a master table, the tables have apparently the same hierarchical level, there is a master table that can serve as reference?

  • @Ernandes No, the original purpose was for one to store initiated transactions (e.g., to write a check) and another to store confirmed transactions (e.g., the check paid off). In a way, the second can be considered "master" - because after bank conciliation all (N) the releases that appeared in the extract are associated with the (1) new account balance on the date of the statement. Now that I read your inquiry, I realized that in fact it is not the table which is redundant, just a single column (saldo). I’ll edit the question to make it clearer.

  • 1

    exactly @mgibsonbr, as I mentioned in the answer, data that undergo mutation should preferably not be redundant, to avoid consistency problems, however it is not general, there are exceptions

7

TL;DR

Participating in the development of a financial system used by several institutions, among them a bank of a very famous car manufacturer worldwide, my team decided to eliminate the table of balances to the detriment of high performance queries using the aggregation function SUM to always calculate the balance in real time.

Background

At the time, we were "presented" with a legacy system that we would migrate to the Java platform. The system generated daily drives and stored previous balances (D-1) to try to improve performance.

But like any legacy system, it was full of bugs that the original team just couldn’t fix. One of the worst was the generation, on seemingly random days, of incorrect balances for some accounts. How the system trusted the previous balance in order to carry out the future calculations, the error persisted and accumulated over the following days. Fixing the problem was always complicated as it was necessary to trace the entire history to verify where the initial cause was.

In addition, the balance table contained records only on days when there was some movement, so do the join always added some complexity to the queries, because it was necessary recover the balance the date of which was maximum and less than the current. The final balance was calculated by adding the last balance to the current day’s movement in the system.

See an example in SQL Server:

select
    isnull((
        select sum(m.valor_movimento * CASE WHEN m.debcred = 'CREDITO' THEN 1 ELSE -1 END)
        from movimentacao m
        join composicao c 
            on c.tipo_movimento = m.tipo_movimento
            and c.cod_saldo = ?
        where m.data_contabil = ?
          and m.status = 1 -- ativa, isto é, não estornada
    ), 0) 
    +
    isnull((
        select valor_saldo
        from saldo
        where cod_saldo = ?
          and data_saldo = (
              select max(data_saldo)
              from saldo s2
              where s2.data_saldo < ?
                and s2.cod_saldo = saldo.cod_saldo
          )
    ), 0)

One of the challenges we were facing was that new customers were acquiring this system and the financial operations currently managed by competitors would be migrated to our system. The estimate initial of the largest client was the migration of 20 thousand operations, with 48 to 120 installments each, and each installment generates daily movement of appropriation, just to mention one of the cases.

The solution we have adopted

We did several tests on SQL Server (2005+) and Oracle (9i+) and we came to the conclusion that, if we had the necessary indexes, run a SUM in the drive table and simply ignoring the balance table did not negatively affect performance. On the contrary, this facilitated the creation of more robust queries and commands INSERT INTO () SELECT that allowed us to operate in the thousands of operations at once.

Just to give you an idea, there were still several routines batch that relied on "caches" of information of various types (interest, business days, etc.), because in theory this would be faster than the calculation online. I try not to criticize much the previous team, after all it was what they taught me too. However, when we executed the batch migration, we found that migrating the client’s 20,000 operations would take approximately two weeks on a server top lineman. After performing optimizations with robust queries and suitable indexes instead of relying on "cache tables", the time dropped to approximately 20 minutes. Obviously this required several days of work rewriting dozens of procedures and functions, but it was really worth it.

In addition to the indexes, we create a numerical column in the drive table whose value would be 1 for credit and -1 for debit. Actually, it was not exactly in the drive table, because the system had another table of composition of balances, that allowed us to create different types of accounting balances without changing the system, according to the type of movements.

The most robust way we used to finally calculate the balance was more or less like this:

select isnull(sum(m.valor_movimento * c.debcred_int), 0)
from movimentacao m
join composicao c 
    on c.tipo_movimento = m.tipo_movimento
    and c.cod_saldo = ?
where m.data_contabil <= ?
and m.status = 1 -- ativa, isto é, não estornada

I would like to post more examples, but I do not have those queries at hand and the answer is already half long.

Completion

Within our context, calculating by summation was more efficient, increased flexibility to perform mass operations and simplified development.

One of the main drawbacks of this approach would be systems where the major focus is on preserving historical data. Counters in general like tables where they can see the "raw" data and they don’t like the possibility of a drive affecting data "in the past". In fact, they prefer that the system continues to calculate incorrect values in the past and only the present is corrected. Some occurrences that we had in this sense were in reports and, in these cases, we created historical tables, but without affecting the core of the system.

Moreover, for much larger data volumes, now imagining the movement of a large institution like Itaú, Bradesco, Banco do Brasil, this solution would be unviable.

But the question here is whether or not it’s worth adding such complexity to systems that don’t have such a scale. One of the arguments is that we should look to the future, with which I agree in part. But if a large financial institution were to acquire such a system, would that not require a series of changes? It is worth spending effort optimizing and creating features to maybe one day, who knows, use these resources?

  • Note that my answer focuses on the specific case cited in the question. It is logical that breaking normalization is important in some cases, as well explained by @hernantes in the other answer.

  • I agree, the existence of two types of records has brought similar problems in my project as well. Good to know that it is possible to calculate the balance through a SUM with good performance, this was a concern of mine too (apart from the issue of error recovery, which unfortunately was not addressed in any of the answers...).

5

Maintaining your database with all non-repeated data, etc., is called normalization, and there are "5 normal forms"

The reverse process, which you describe, is the denormalization: try to optimize a database in relation to its performance, as a rule.

On denormalization, the Oracle comenta (source in English):

  • Consider denormalization as a last resort to improve system performance

  • To improve performance, denormalization should be done during database design

  • denormalization always implies an increase in code

On the latter, think that it will always be necessary to maintain an integrity in the code if it is not maintained in the Database or by triggers.

The most common ways of denormalizing the data are (for the same source above):

  • store derivative values: hold the final balance, hold daily balances, hold annual balances, etc.

  • keep details with the master table: when there is a master-daughter table scheme, only one value is stored in the daughter. You could store the direct value in the master table, as long as you take care that updates are always done in all records

  • coded values: or hard-coded values: instead of storing values in the database, you put some values in the application. Yes, this is denormalizing :)

  • keys to "grandchildren" tables: if your database has many levels of master-details - details of the details, it may be interesting to keep a key short-circuiting between the master and the detail table that is most consulted, avoiding a Join in the survey

  • derived dates: when you want to do a date search that is between a start interval - end, and the end date is calculated from the start date, it may be faster you already leave the calculated end date and do a search using a between

Completion: database normalization exists to ensure that data is consistent, related, etc. This is done at a cost of performance. In certain cases it may be worthwhile to move this time to the insertion (i.e., your Insert is more complex, or the application has to have more code in the data insertion), making the data recovery later be faster. And the key to measuring this is the amount of times a data is entered and the amount of times the same data is retrieved later.

  • 1

    Thanks for the information, I’ll study them. As for "complexity in insertion", fortunately in my particular case this is no problem - the table of "inputs and outputs" is populated as transactions are made (e.g.: I wrote a check), while the table of "balances" is populated at the time of bank conciliation (ex.: the check was cashed, is here in the statement). However, I am aware that even this could be done in a standardised way (as other systems similar to mine do).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.