Create a VIEW or a new TABLE in Postgres?

Asked

Viewed 570 times

1

Olás! I have a situation... I have a table that will have about 3*10 12 lines (3 trillion), but with only 3 attributes. In this table will have the Ids of 2 individuals and the similarity between them (is a number between 0 and 1 that I multiplied by 100 and put as a smallint to decrease the space). It turns out that I need to perform, for a certain individual who wants to do the research, the summarization of these columns and returning how many individuals have up to 10% similarity, 20%, 30%... These values are fixed (from 10 to 10), until arriving in identical individuals. However, as you may know, the consultation will be very slow, so I thought: - Create a new table to store summarized values - Create a view to store these values. Since the individuals are about 1.7 million, the search would not be as time consuming (if indexed, returns very fast). So, what can I do? I stress that my population will be practically fixed (after the BD is fully populated, it is expected that almost no increase will be made).

  • this table that has 3 rails undergoes some kind of update or Insert? is a statistic value?

  • Hudson: "Saliento que essa minha população será praticamente fixa (após o BD ser totalmente populado, espera-se que seja feito quase nenhum acréscimo)". It would only work with View if the query to be executed is always the same, without the need to apply conditions, otherwise it gets very bad. I think there could be a new table, that with a Trigger make the similarity ratio for each subject

  • Have you ever thought of creating indexes based on expressions ? It seems to be the case that Postgreesql seems to support https://www.postgresql.org/docs/9.1/static/indexes-expressional.html

  • @Hudsonph, this table will not be updated, however, will have Inserts whenever a new individual is inserted in the list of individuals (after the table created and fully populated, it may be necessary to insert a hit or two, but rarely).

  • @Motta, I will study this possibility. For now I have created 2 indexes: one for each individual column (to find them in the middle of the huge list).

  • @Rovannlinhalis. So, from what I understand, for you, a view wouldn’t be interesting because you have to keep making new queries every time I run (like, always wanting to change which individual I want to return), right? Now, I don’t understand how Trigger would be useful.

  • If you can show the example of the calculation of its "summary", in the answer of the lacobus I commented on the Trigger

Show 2 more comments

2 answers

1

In your case an index could perfectly solve your problem without the need to create a VIEW.

Assuming your data structure is something like:

-- INDIVIDUO
CREATE TABLE tb_individuo
(
    id BIGINT NOT NULL,
    nome TEXT NOT NULL,
    PRIMARY KEY( id )
);

-- SEMELHANCA
CREATE TABLE tb_semelhanca
(
    id_a BIGINT NOT NULL,
    id_b BIGINT NOT NULL,
    valor SMALLINT NOT NULL,
    PRIMARY KEY( id_a, id_b ),
    FOREIGN KEY ( id_a ) REFERENCES tb_individuo ( id ),
    FOREIGN KEY ( id_b ) REFERENCES tb_individuo ( id )
);

Containing the following sample data:

-- INSERE 5 INDIVIDUOS
INSERT INTO tb_individuo ( id, nome ) VALUES ( 1, 'JOSE' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 2, 'JOAO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 3, 'ANTONIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 4, 'MARIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 5, 'PEDRO' );

-- INSERE SEMELHANCAS ENTRE OS INDIVIDUOS 
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 2, 5 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 3, 13 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 4, 27 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 5, 98 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 3, 54 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 4, 63 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 5, 77 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 4, 85 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 5, 42 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 4, 5, 33 );

I suggest creating an index in the field valor table tb_semelhanca:

-- CRIACAO DE INDICE NO GRAU DE SEMELHANCA
CREATE INDEX ON tb_semelhanca ( valor );

What would enable queries using a BETWEEN to recover the desired intevalo with enough agility:

-- CONSULTA USANDO O INDICE
SELECT
    a.id,
    a.nome,
    b.id,
    b.nome,
    s.valor
FROM
    tb_semelhanca AS s
JOIN
    tb_individuo AS a ON ( a.id = s.id_a )
JOIN
    tb_individuo AS b ON ( b.id = s.id_b )
WHERE
    s.valor BETWEEN 20 AND 50;

Exit:

| id |    nome | id |  nome | valor |
|----|---------|----|-------|-------|
|  1 |    JOSE |  4 | MARIO |    27 |
|  4 |   MARIO |  5 | PEDRO |    33 |
|  3 | ANTONIO |  5 | PEDRO |    42 |

Sqlfiddle: http://sqlfiddle.com/#! 17/da827/1

EDIT

If the need to create a "summary" is something mandatory, I still do not see the need to create a VIEW or auxiliary table.

As the cardinality of sumário with the semelhanca would be of 1-para-1, the simple creation of a tb_semelhanca could solve the problem, let’s see:

-- INDIVIDUO
CREATE TABLE tb_individuo
(
    id BIGINT NOT NULL,
    nome TEXT NOT NULL,
    PRIMARY KEY( id )
);

-- SEMELHANCA
CREATE TABLE tb_semelhanca
(
    id_a BIGINT NOT NULL,
    id_b BIGINT NOT NULL,
    valor SMALLINT NOT NULL,
    sumario SMALLINT NOT NULL,
    PRIMARY KEY( id_a, id_b ),
    FOREIGN KEY ( id_a ) REFERENCES tb_individuo ( id ),
    FOREIGN KEY ( id_b ) REFERENCES tb_individuo ( id )
);

The index would be created in the field sumario of tb_semelhanca:

-- CRIACAO DE INDICE NO sumario
CREATE INDEX ON tb_semelhanca ( sumario );

One TRIGGER FUNCTION would be responsible for calculating the summary:

CREATE OR REPLACE FUNCTION calcular_sumario() RETURNS TRIGGER AS
$BODY$
BEGIN 
    NEW.sumario = (NEW.valor / 10)::int * 10; 
    RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;

One TRIGGER would be created to trigger the function calcular_sumario() every time a modification of the tb_semelhança:

-- CRIANDO TRIGGER
CREATE TRIGGER trigger_sumario BEFORE INSERT OR UPDATE ON tb_semelhanca FOR EACH ROW EXECUTE PROCEDURE  calcular_sumario();

Populating tables:

-- INSERE 10 INDIVIDUOS
INSERT INTO tb_individuo ( id, nome ) VALUES ( 1, 'JOSE' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 2, 'JOAO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 3, 'ANTONIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 4, 'MARIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 5, 'PEDRO' );

-- INSERE SEMELHANCAS ENTRE OS INDIVIDUOS 
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 2, 5 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 3, 13 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 4, 27 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 5, 98 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 3, 54 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 4, 63 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 5, 77 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 4, 85 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 5, 42 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 4, 5, 33 );

Checking out tb_semelhanca:

| id_a | id_b | valor | sumario |
|------|------|-------|---------|
|    1 |    2 |     5 |       0 |
|    1 |    3 |    13 |      10 |
|    1 |    4 |    27 |      20 |
|    1 |    5 |    98 |      90 |
|    2 |    3 |    54 |      50 |
|    2 |    4 |    63 |      60 |
|    2 |    5 |    77 |      70 |
|    3 |    4 |    85 |      80 |
|    3 |    5 |    42 |      40 |
|    4 |    5 |    33 |      30 |

Consultation through the summary:

-- CONSULTA USANDO O SUMARIO
SELECT
    a.id,
    a.nome,
    b.id,
    b.nome,
    s.valor
FROM
    tb_semelhanca AS s
JOIN
    tb_individuo AS a ON ( a.id = s.id_a )
JOIN
    tb_individuo AS b ON ( b.id = s.id_b )
WHERE
    s.sumario = 20;

Exit:

| id | nome | id |  nome | valor |
|----|------|----|-------|-------|
|  1 | JOSE |  4 | MARIO |    27 |

Sqlfiddle: http://sqlfiddle.com/#! 17/7e62f/5

  • I think the question would be to calculate the 'summarizing' of the similarities in a view, or to store them, in a third table. I suggested that you make the similarity table and a Rigger in the individuals table, so that when you insert or alter an individual, their similarity is recalculated. Wandreveloso, do you understand the proposal ? @Lacobus always with great answers about postgres =]

  • 1

    @Rovannlinhalis The following edition.

  • @Lacobus and Rovannlinhalis, I really liked the option of adding a new indexed field to speed up the search. I believe I will implement this way with index creation in the first column (because I always want to know the summary for a certain individual) and in the summary column.

0

I had almost the same problem, but my bank is much bigger than yours and suffers updates and Insert every minute, the tables that passed 1gb I used Partition based on a column related to group.

The best solution after your answer in the comments, use Table Partition

because?
1 - will divide your table into subtabels and will not interfere in your select(will continue the same but very fast).

2 - since Voce does not make use of update your Rigger will not have problem with "on conflit", Trigger will only be used in the parent table to redirect the Insert.

3- Table Partition does not only divide table by date!!! Window can use some codicao but elaborate, ie "similarity between them".

How will your db, example:

TabelaMaster (Tabela original)
TabelaMaster-10 (Tabela gerada apartir do INHERITS )
TabelaMaster-20
...

selects can be made direct on the partition or in the table "parent" the result is almost the same, however use the parent avoid extra code.

  • Hello @Hudsonph, I don’t think Table Partition would be interesting since, in my case, I have a huge list of individuals with well concentrated similarities between 40 and 80%, which would divide the table into not so interesting parts (I am still thinking of using the technique mentioned in the previous comment in conjunction with Partition). Thank you!

  • You can even test, but the refresh view in your case will have a very high cost after all are 3 tracks that will be read and replaced every day in the view

Browser other questions tagged

You are not signed in. Login or sign up in order to post.