Whereas:
- the order of the elements (numbers) in the set does not matter;
- there are no repeated elements (that is, a same number cannot occur
more than once in the same set); and
- the values of the numbers vary from 0 to 99,
I suggest you store the sets in bit map form. For each set a 100-position vector is created, where each position represents a number; if the number is present in the set, the position is 1, but if the number is not part of the set, the position is 0.
Assuming that each set can have elements ranging from 1 to 10, to represent the set {4, 1, 9} we would have 1001000010, where the first bit represents the number 1, the second bit represents the number 2 and so on, up to 10.
Considering that repeated sets may exist, the bit map is in a table. That is, each existing element combination occupies a single bit map. In another table, a relation indicating which bit map each set uses.
The initial idea was to implement the bit map with a column declared as bit(100). That is, a bit string. But unfortunately this is not allowed in T-SQL. To circumvent this restriction, the first version then uses a byte string: char(100).
CREATE TABLE Tab_Bloco (
ID int identity,
Bloco char(100) not null default replicate('0', 100)
);
The other table contains, for each set, which bit map it uses.
CREATE TABLE Conjunto_Bloco (
ID_Conjunto int not null,
ID_Bloco int not null
references Tab_Bloco(ID)
);
For example, sets 1:{4, 1, 9}, 2:{1, 2, 3, 5, 7} and 3:{1, 4, 9} would be represented like this:
Tab_Bloco
1 1001000010
2 1110101000
Conjunto_Bloco
1 1
2 2
3 1
Note that sets 1 and 3 share the same bit map.
In principle, if it were possible to use the bit string, each set would only occupy 21 bytes (4 of Id_set, 4 of Tab_block ID, 13 of Block in Tab_block). It is obvious that this account does not include indexes and other controls. But, for now, the proposed solution takes 108 bytes (4 of Id_set, 4 of Tab_block ID, 100 of Block in Tab_block).
There is a way to implement the bit map in 4 variables of type int, since each one occupies 32 bits. Or maybe Binary(13). But it is necessary to use bitwise operations.
As the initial goal is performance, the new storage structure will ensure speed to find repeated sets as well as "such a set exists?".
To validate the new structure it is necessary to know which are the main manipulations with the data. At first, it will not be necessary to store the sets in the old form, but only as bit maps. That is, each new set to be added is stored directly as a bit map. From each vector it is possible to return the numbers that make up each element. But if it is necessary to keep the sets in the two representations (numbers and bit maps), then it is recommended to use procedure Trigger to ensure consistency between the two tables.
To check for repeated sets, it is a very simple process.
-- código #2
with Repetidos as (
SELECT ID_Bloco, Count(*) as Qtd
from Conjunto_Bloco
group by ID_Bloco
having Count(*) > 1
)
SELECT T1.ID_Conjunto, T1.ID_Bloco
from Conjunto_Bloco as T1
inner join Repetidos as T2 on T1.ID_Bloco = T2.ID_Bloco
order by ID_Bloco;
Code #1 below converts the current table to the two new tables. We opted for the use of cursor, for reading the data, because it seemed to me, in this case, more efficient than the use of code set based.
-- código #1
-- cria as duas novas estruturas de dados
CREATE TABLE Tab_Bloco (
ID int identity,
Bloco char(100) not null default replicate('0', 100)
);
CREATE clustered index I1_Tab_Bloco on Tab_Bloco (Bloco);
CREATE unique nonclustered index I2_Tab_Bloco on Tab_Bloco (ID);
CREATE TABLE Conjunto_Bloco (
ID_Conjunto int not null,
ID_Bloco int not null
references Tab_Bloco (ID)
);
CREATE clustered index I1_Conjunto_Bloco on Conjunto_Bloco (ID_Conjunto, ID_Bloco);
CREATE unique nonclustered index I2_Conjunto_Bloco on Conjunto_Bloco (ID_Bloco, ID_Conjunto);
go
--
DECLARE Lê_Número CURSOR
forward_only
for SELECT Conjunto, Numero
from Conjunto order by Conjunto;
declare @ID_Conjunto int, @Número int;
declare @ID_atual int;
declare @ID_Bloco int, @Bloco char(100);
OPEN Lê_Número;
-- obtém primeiro número do primeiro conjunto
FETCH NEXT
from Lê_Número
into @ID_Conjunto, @Número;
while @@fetch_status = 0
begin
-- novo conjunto
set @ID_atual= @ID_conjunto;
set @Bloco= replicate('0', 100);
-- coleciona números do conjunto atual
while @@fetch_status = 0 and @ID_Conjunto = @ID_atual
begin
-- liga bit respectivo do número
set @Bloco= Stuff(@Bloco, @Número+1, 1, '1');
-- obtém próximo número
FETCH NEXT
from Lê_Número
into @ID_Conjunto, @Número;
end;
-- fim números do conjunto atual
-- registra bloco
set @ID_Bloco= (SELECT ID from Tab_Bloco where Bloco = @Bloco);
IF @ID_Bloco is null
begin
-- novo bloco
INSERT into Tab_Bloco (Bloco) values (@Bloco);
set @ID_Bloco= scope_identity();
end;
-- estabele ligação de conjunto e bloco
INSERT into Conjunto_Bloco (ID_Conjunto, ID_Bloco)
values (@ID_atual, @ID_Bloco);
end;
-- fim conjunto
CLOSE Lê_Número;
DEALLOCATE Lê_Número;
go
Missing validate this model with a large volume of data.
Does the order of the elements in the set matter in comparison? That is, the set {1, 12, 4, 6} is equal to or different from {1, 4, 6, 12} ?
– José Diz
Should the sets to be compared have the same number of elements or does the comparison consider the set of fewer elements? That is, {1, 12, 4} equals {1, 12, 4, 6}?
– José Diz
One set must be equal to the other no matter the order and must have the same amount of numbers, that is the set A {6, 12, 4, 1} is equal to the other B {1, 12, 4, 6}.
– Leandro
If, for the comparison of sets, the order in which the elements are in does not matter, what is the "order" column for? Are numerical values always small? For example, they vary from 1 to 50.
– José Diz
is a way that I found as a solution to be able to compare using the above query, but in fact the order does not matter. The numbers vary from 0 to 99.
– Leandro