How to compare set of numbers with other set?

Asked

Viewed 957 times

3

How to mount query to compare if a set A of numbers is equal to another, of B, with the same numbers of set A, and so on.

Example:

Table data

id  conjunto    numero      ordem
1   1           1           1
2   1           12          2
3   1           4           3
4   1           6           4
5   2           1           1
6   2           12          2
7   2           4           3
8   2           6           4
9   3           15          1
10  3           17          2
11  3           20          3
12  4           15          1
13  4           17          2

Result after comparison: duplicate sets 1 and 2

I have a table with more than one million sets and several numbers for each set and it is taking too long to return result. I couldn’t think of any other way to compare it to using For XML and creating TEMP temporary table. There is another way to compare efficiently and performatively?

Below is a script that makes the comparison:

select
    t.*
INTO 
#TABELA
from
    (
        select 1 id, 1 conjunto, 1 numero, 1 ordem
        union all
        select 2 id, 1 conjunto, 12 numero, 2 ordem
        union all
        select 3 id, 1 conjunto, 4 numero, 3 ordem
        union all
        select 4 id, 1 conjunto, 6 numero, 4 ordem
        union all
        select 5 id, 2 conjunto, 1 numero, 1 ordem
        union all
        select 6 id, 2 conjunto, 12 numero, 2 ordem
        union all
        select 7 id, 2 conjunto, 4 numero, 3 ordem
        union all
        select 8 id, 2 conjunto, 6 numero, 4 ordem
        union all
        select 9 id, 3 conjunto, 15 numero, 1 ordem
        union all
        select 10 id, 3 conjunto, 17 numero, 2 ordem
        union all
        select 11 id, 3 conjunto, 20 numero, 3 ordem
        union all
        select 12 id, 4 conjunto, 15 numero, 1 ordem
        union all
        select 13 id, 4 conjunto, 17 numero, 2 ordem
    ) t   


select DISTINCT
T1.conjunto,
(SELECT RIGHT('' + CONVERT(varchar, numero), 2) as [text()] from #TABELA i where i.conjunto=t1.conjunto order by i.conjunto, i.ordem for xml path('')) numeros
INTO
#TEMP
from 
#TABELA T1



SELECT
t1.conjunto
FROM
(
    select
    t1.numeros numeros
    from 
    #TEMP t1
    group by
    t1.numeros
    having
    count(*) > 1
) t2
inner join #TEMP t1 on t2.numeros = t1.numeros
  • Does the order of the elements in the set matter in comparison? That is, the set {1, 12, 4, 6} is equal to or different from {1, 4, 6, 12} ?

  • Should the sets to be compared have the same number of elements or does the comparison consider the set of fewer elements? That is, {1, 12, 4} equals {1, 12, 4, 6}?

  • One set must be equal to the other no matter the order and must have the same amount of numbers, that is the set A {6, 12, 4, 1} is equal to the other B {1, 12, 4, 6}.

  • If, for the comparison of sets, the order in which the elements are in does not matter, what is the "order" column for? Are numerical values always small? For example, they vary from 1 to 50.

  • is a way that I found as a solution to be able to compare using the above query, but in fact the order does not matter. The numbers vary from 0 to 99.

2 answers

2


Whereas:

  • the order of the elements (numbers) in the set does not matter;
  • there are no repeated elements (that is, a same number cannot occur more than once in the same set); and
  • the values of the numbers vary from 0 to 99,

I suggest you store the sets in bit map form. For each set a 100-position vector is created, where each position represents a number; if the number is present in the set, the position is 1, but if the number is not part of the set, the position is 0.

Assuming that each set can have elements ranging from 1 to 10, to represent the set {4, 1, 9} we would have 1001000010, where the first bit represents the number 1, the second bit represents the number 2 and so on, up to 10.

Considering that repeated sets may exist, the bit map is in a table. That is, each existing element combination occupies a single bit map. In another table, a relation indicating which bit map each set uses.

The initial idea was to implement the bit map with a column declared as bit(100). That is, a bit string. But unfortunately this is not allowed in T-SQL. To circumvent this restriction, the first version then uses a byte string: char(100).

CREATE TABLE Tab_Bloco (
  ID int identity,
  Bloco char(100) not null default replicate('0', 100)
);

The other table contains, for each set, which bit map it uses.

CREATE TABLE Conjunto_Bloco (
  ID_Conjunto int not null,
  ID_Bloco int not null 
           references Tab_Bloco(ID)
);

For example, sets 1:{4, 1, 9}, 2:{1, 2, 3, 5, 7} and 3:{1, 4, 9} would be represented like this:

Tab_Bloco
1 1001000010
2 1110101000

Conjunto_Bloco
1 1
2 2
3 1

Note that sets 1 and 3 share the same bit map.

In principle, if it were possible to use the bit string, each set would only occupy 21 bytes (4 of Id_set, 4 of Tab_block ID, 13 of Block in Tab_block). It is obvious that this account does not include indexes and other controls. But, for now, the proposed solution takes 108 bytes (4 of Id_set, 4 of Tab_block ID, 100 of Block in Tab_block).

There is a way to implement the bit map in 4 variables of type int, since each one occupies 32 bits. Or maybe Binary(13). But it is necessary to use bitwise operations.

As the initial goal is performance, the new storage structure will ensure speed to find repeated sets as well as "such a set exists?".

To validate the new structure it is necessary to know which are the main manipulations with the data. At first, it will not be necessary to store the sets in the old form, but only as bit maps. That is, each new set to be added is stored directly as a bit map. From each vector it is possible to return the numbers that make up each element. But if it is necessary to keep the sets in the two representations (numbers and bit maps), then it is recommended to use procedure Trigger to ensure consistency between the two tables.


To check for repeated sets, it is a very simple process.

-- código #2
with Repetidos as (
SELECT ID_Bloco, Count(*) as Qtd
  from Conjunto_Bloco
  group by ID_Bloco
  having Count(*) > 1
)
SELECT T1.ID_Conjunto, T1.ID_Bloco
  from Conjunto_Bloco as T1
       inner join Repetidos as T2 on T1.ID_Bloco = T2.ID_Bloco
  order by ID_Bloco;

Code #1 below converts the current table to the two new tables. We opted for the use of cursor, for reading the data, because it seemed to me, in this case, more efficient than the use of code set based.

-- código #1
-- cria as duas novas estruturas de dados
CREATE TABLE Tab_Bloco (
  ID int identity,
  Bloco char(100) not null default replicate('0', 100)
);
CREATE clustered index I1_Tab_Bloco on Tab_Bloco (Bloco);
CREATE unique nonclustered index I2_Tab_Bloco on Tab_Bloco (ID);


CREATE TABLE Conjunto_Bloco (
  ID_Conjunto int not null,
  ID_Bloco int not null
               references Tab_Bloco (ID)
);
CREATE clustered index I1_Conjunto_Bloco on Conjunto_Bloco (ID_Conjunto, ID_Bloco);
CREATE unique nonclustered index I2_Conjunto_Bloco on Conjunto_Bloco (ID_Bloco, ID_Conjunto);
go

--
DECLARE Lê_Número CURSOR 
   forward_only
   for SELECT Conjunto, Numero
         from Conjunto order by Conjunto;
declare @ID_Conjunto int, @Número int;
declare @ID_atual int;

declare @ID_Bloco int, @Bloco char(100);

OPEN Lê_Número;

-- obtém primeiro número do primeiro conjunto
FETCH NEXT
  from Lê_Número 
  into @ID_Conjunto, @Número;

while @@fetch_status = 0
   begin
   -- novo conjunto
   set @ID_atual= @ID_conjunto;
   set @Bloco= replicate('0', 100);

   -- coleciona números do conjunto atual
   while @@fetch_status = 0 and @ID_Conjunto = @ID_atual
      begin
      -- liga bit respectivo do número
      set @Bloco= Stuff(@Bloco, @Número+1, 1, '1');

      -- obtém próximo número 
      FETCH NEXT
        from Lê_Número 
        into @ID_Conjunto, @Número;
      end;
      -- fim números do conjunto atual

   -- registra bloco
   set @ID_Bloco= (SELECT ID from Tab_Bloco where Bloco = @Bloco);
   IF @ID_Bloco is null
      begin
      -- novo bloco
      INSERT into Tab_Bloco (Bloco) values (@Bloco);
      set @ID_Bloco= scope_identity();
      end;

   -- estabele ligação de conjunto e bloco
   INSERT into Conjunto_Bloco (ID_Conjunto, ID_Bloco)
     values (@ID_atual, @ID_Bloco);

   end;
   -- fim conjunto

CLOSE Lê_Número;

DEALLOCATE Lê_Número;
go

Missing validate this model with a large volume of data.

  • Using bit maps has gotten considerably faster compared to other forms of comparison using numbers.

  • @Leandro, good news. As T-SQL does not accept bit(n), we had to use char(n). That is, it became a byte map. But I will refine code #1, so implement bit map using bitwise operator. I hope it works!

2

Although I doubt whether performance will improve, another way to compare records is by using INTERSECT. Example:

declare @conjunto int, @conjMin int

select  @conjunto = max(conjunto),
        @conjMin = min(conjunto)
from #TABELA

select  *, num_rows = (select count(*) from #TABELA where conjunto = t.conjunto),
        is_duplicated = 0
into #tempTable
from #TABELA t

while @conjunto >= @conjMin begin
    if((select top 1 1
        from(   SELECT numero, ordem, num_rows
                FROM #tempTable t1
                where t1.conjunto = @conjunto
                INTERSECT
                SELECT numero, ordem, num_rows
                FROM #tempTable t2
                where t2.conjunto <> @conjunto)
            t))>0 begin

            update #tempTable
            set is_duplicated = 1
            where conjunto = @conjunto 
        end

    set @conjunto = @conjunto-1
end

select distinct conjunto from #tempTable where is_duplicated = 1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.