Optimize a data transfer from one table to another SQL SERVER

Asked

Viewed 1,649 times

3

SQL SERVER Database I have two tables, one with 2 billion records and the other with 16 billion. I am copying the first to the second.

Using the

INSERT INTO table2
SELECT * FROM table1;

Doesn’t work because the generated log crashes my hard drive.

I made the sql block with the following logic:

/*** Sql para juntar o conteúdo das tabela lote4 e carga*/
USE [bd_contagem]
GO
BEGIN
-- Incluindo as informações na tabela temporaria

DECLARE @lote_XXX float
DECLARE @numero_contrato nvarchar(255) 
DECLARE @codigo_equipamento nvarchar(255) 
DECLARE @sentido nvarchar(255)
DECLARE @faixa float
DECLARE @data_contagem date
DECLARE @hora time
DECLARE @classe float
DECLARE @veloc float
DECLARE @v_cont float = 0
declare @vcont_qte_inseridas float
declare @dia int
declare @mes int
declare @ano int
declare @dia_semana int
declare @id_equipamento int

DECLARE crLerLote4 CURSOR FOR
SELECT  '4','TT-954/2010-00'
      ,[codigo_equipamento]
      ,SUBSTRING( [faixa],3,1) AS SENTINDO
      ,SUBSTRING( [faixa],5,1) AS FAIXA
      ,[data_contagem]
      ,[hora]
      ,[classe]
      ,[veloc]
      ,[id_equipamento]
      ,[dia]
      ,[mes]
      ,[ano]
      ,[dia_semana]
  FROM [bd_contagem].[dbo].[tblVMDcargaLote4]

OPEN crLerLote4

FETCH NEXT FROM crLerLote4 INTO
        @lote_XXX,  @numero_contrato,@codigo_equipamento,@sentido,  @faixa,
        @data_contagem,@hora,@classe,@veloc,@id_equipamento,@dia,@mes,@ano,
        @dia_semana

WHILE @@FETCH_STATUS = 0
    BEGIN
        BEGIN TRANSACTION
            INSERT INTO [bd_contagem].[dbo].tblVMDcarga
                        (lote_XXX
                        ,numero_contrato
                        ,codigo_equipamento
                        ,sentido
                        ,faixa
                        ,data_contagem
                        ,hora
                        ,classe
                        ,veloc
                        ,id_equipamento
                        ,dia
                        ,mes
                        ,ano
                        ,dia_semana)
                 VALUES
                       (@lote_XXX,
                        @numero_contrato,
                        @codigo_equipamento, 
                        @sentido,
                        @faixa,
                        @data_contagem,
                        @hora,
                        @classe,
                        @veloc,
                        @id_equipamento,
                        @dia,
                        @mes,
                        @ano,
                        @dia_semana)

            delete [bd_contagem].[dbo].[tblVMDcargaLote4]
              where codigo_equipamento = @codigo_equipamento
                  and data_contagem = @data_contagem
                  and hora = @hora
                  and classe = @classe
                  and veloc = @veloc
                  and id_equipamento = @id_equipamento


        FETCH NEXT FROM crLerLote4 INTO
            @lote_XXX,  @numero_contrato,@codigo_equipamento,@sentido,  @faixa,
            @data_contagem,@hora,@classe,@veloc,@id_equipamento,@dia,@mes,@ano,
            @dia_semana

        --set @v_cont = @v_cont + 1  
        --print @v_cont
        IF (@@ERROR = 0)
        BEGIN
            COMMIT TRANSACTION
        END
        ELSE
        BEGIN
            ROLLBACK TRANSACTION
        END
    END
    CLOSE crLerLote4
    DEALLOCATE crLerLote4
END

This process is copying on average 15,000 records per second.

If I run the same block twice, I realize that the performance does not improve.

I need to improve this performance, and I would like to know if there are any suggestions for a better solution.

  • 1

    And in the matter of Hardware there would be no improvement, for example, increasing its processing capacity

  • I advise you not to use the cursor for this, it is very slow, the ideal would be you do a SELECT INSERT, that way would be much faster

  • You have an interesting problem on your hands. I say that CURSOR is a bad choice for your case, because it is a lot of records. What you can do is disable transaction log generation, do what needs to be done, and turn it back on.

  • Just one important detail: the fact of using CURSOR will not cause the log to not be generated. The HD only did not work because the CURSOR is orders of magnitude slower than the INSERT SELECT. Therefore, if you let the CURSOR run long enough your hard drive will burst as well.

  • @Cirineulacerdabraga: What is the recovery model? // What is the version of SQL Server? // What is the primary key of the table [bd_count]. [dbo]. [tblVMDcargaLote4]? // Can the database be placed in single-user while the line drive is performed? // All rows in table tblVMDcargaLote4 are of the same lot/contract?

  • @Cirineulacerdabraga: Care with the use of float variables, in this case. I refer to the variables @lote_XXX, @sense and@track, which receive 1 character as content. Why not declare as char(1) or, better yet, as stated in the destination table (tblVMDload)?

  • @Cirineulacerdabraga: To reduce the size of the log file, I suggest you remove the DELETE command tblVMDcargaLote4, leaving only the copy part (INSERT into INTO tblVMDload). // To delete the contents of table tblVMDcargaLote4, it is more efficient (in terms of reducing log) to run TRUNCATE tblVMDcargaLote4).

  • @Josédiz With TRUNCATE he cannot use the WHERE clause as he is doing DELETE. The best option is to disable LOG generation and do INSERT SELECT or maybe, I have no way to test with this amount of data, use BULK INSERT.

Show 3 more comments

2 answers

2

Cirineu, why you report the problem is that, during the execution of the first T-SQL code described, all the available space on the disk is used, even before the execution of the code ends, aborting the process of transferring lines between the tables. And mentions how possible causes the growth of transaction log.

By the name of the columns present in tblVMDcarga, as well as by the table name, it seems to me that it is a load table (ETL) of a fact table of data Warehouse, probably DNIT traffic count.

Additional information is required to give a specific opinion; as not yet provided, general comments.


Recovery models

As per SQL Server documentation, "recovery models are designed to control transaction log maintenance". The reading of the documentation shows how the definition of the recovery model impacts directly on the log volume generated.


minimal logging

Regarding the transaction log, there is an approach that is called minimal logging. It is described in some SQL Server documents such as:


There are some requirements for success in this approach, which are described in the above documents. Note that the INSERT construction... SELECT is also considered an operation Bulk import.


tempDB

It is necessary to be attentive to the configuration of the tempDB database, so that it does not become a bottleneck. I suggest reading the article How to Configure Tempdb for Microsoft SQL Server


Parallel execution

Another factor that must be analyzed carefully is how is configured, in the instance, the option max Degree of parallelism.


TRAM (very large Tables)

In the case of VLT tables, one of the options used to transfer rows from one table to another is to copy them into blocks. That is, transfer a piece at a time. The content/size definition of each block depends on the context.

To use block transfer efficiently it is necessary that in the source table there is unique identification (key Unique) to identify what has already been transferred. In the case of fact tables, it is usually used surrogate key. For the sake of performance, it is recommended that there be index by unique identification, preferably clustered.

Assuming you want to transfer millions of rows from the stageDADOS table to the DATA table and the primary key of the stageDADOS table is the idEvento column. You can implement block transfer using something similar to

-- código #1
INSERT into DADOS (colunas)
    SELECT colunas
      from stageDADOS
      where idEvento between ....

At each execution of the above code, a block of lines is transferred. You must keep track of which was the last line transferred at each run, so the next run is from the first line not yet transferred.

That is, with the above approach it is possible to transfer block by block and monitor the growth of the log file.

Follow example with implementation of this approach.

-- código #2
CREATE TABLE stageDADOS (
   idEvento int identity primary key, 
   dataEvento date not null,
   infoEvento varchar(200) not null
);

CREATE TABLE DADOS (
   idEvento int not null, 
   dataEvento date not null,
   infoEvento varchar(200) not null
);
go

It is observed that the DATA table is of the type heap.

To control what has already been transferred, the Controletransf table is created in the database. It will allow you to keep track of what has already been transferred, as well as issue statistics on the transfer process.

-- código #3
CREATE TABLE ControleTransf (
   idBloco int identity, 
   inicioTransf smalldatetime not null, 
   fimTransf smalldatetime default cast(current_timestamp as smalldatetime),
   nLinhas int,
   chaveInicial int, 
   chaveFinal int
);
go

And finally, the procedure that carries out the transfer of a block of lines.

-- código #4
CREATE PROCEDURE TransfBloco @tamBloco int as
begin
declare @ultimaChave int, @chaveInicial int, @chaveFinal int;
declare @inicioTransf smalldatetime, @nLinhas int;

-- procura última chave transferida
SELECT top (1) @ultimaChave= chaveFinal
  from ControleTransf
  order by idBloco desc;
set @ultimaChave= coalesce(@ultimaChave, 0);

-- define chaves inicial e final do bloco
set @chaveInicial= @ultimaChave + 1;
set @chaveFinal= @ultimaChave + @tamBloco;

-- 
set @inicioTransf= cast(current_timestamp as smalldatetime);

--
BEGIN TRANSACTION;

BEGIN TRY

-- transfere bloco
INSERT into DADOS with (TABLOCK) (idEvento, dataEvento, infoEvento) 
    SELECT idEvento, dataEvento, infoEvento
      from stageDADOS
      where idEvento >= @chaveInicial
            and idEvento <= @chaveFinal;

-- 
set @nLinhas= @@rowcount;

-- registra bloco transferido
IF @nLinhas > 0
  begin
  INSERT into ControleTransf (chaveInicial, chaveFinal, inicioTransf, nLinhas)
         values (@chaveInicial, @chaveFinal, @inicioTransf, @nLinhas);
  SELECT 'Transferidas ' + cast(@nLinhas as varchar(15)) + ' linhas.' as Status;
  end
else
  SELECT 'Nenhuma linha transferida.' as Status;    

-- encerra transação explícita
COMMIT; 

END TRY

BEGIN CATCH;
SELECT 'erro na transferência' as Status;
ROLLBACK TRANSACTION;
END CATCH;
end;
go


The execution is simple. It is considered that the database recovery model is as Bulk-logged. To transfer 100 million lines, simply call the Transfbloco procedure, passing the number of lines as parameter.

-- código #5 v2
-- transfere um bloco de linhas
EXECUTE dbo.TransfBloco 100000000
-- monitora ocupação de espaço físico
EXECUTE sp_spaceused 'DADOS';
DBCC SQLPERF (logspace);
go

I suggest starting with a smaller block (10 million lines, for example). You will then have information about performance, time, and physical space allocation for this block size, and you can then define the optimal block size for each run.

The above approach can be modified if the primary key column is, for example, of the type uniqueidentifier.


RBAR

RBAR is the abbreviation of the expression Row-By-Agonizing-Row, coined by Jeff Moden for codes that process one line at a time. Whenever possible we should build codes that handle line sets (set-based), which is more efficient processing on SQL Server. However, there are (few) situations where the use of cursor is more efficient than the use of constructions set-based.

In the second code provided, with the use of cursor, one perceives the presence of an RBAR approach. And, in this case, it seems to me extremely inefficient. In addition to transferring row by row from the source table to the destination table, there is still the aggravating fact that the content of the source table is also deleted row by row. That is, it can increase the log file faster (see recovery templates).


Optimization of code with cursor

(1) In terms of log file, I consider it more efficient to manually delete the source table at the end (when the transfer is considered correct), either through DROP TABLE command (if the source table is no longer needed) or TRUNCATE TABLE command.

The fact of deleting row by row in the source table does not mean that the physical space will be immediately released by the sgbd.

But if you think you need to delete line by line, note that you can delete the current line from the cursor without the need for the DELETE statement to perform a new search (which requires more processing time). Just use CURRENT OF:

-- código #6
DELETE [bd_contagem].[dbo].[tblVMDcargaLote4]
  where current of crLerLote4;

(2) To the declare a cursor there are options that theoretically can optimize the execution of it. Careful reading of the documentation can indicate possible options to use.

  • Thank you José, all your comments were pertinent. Since it’s only once I need to do this Reginaldo’s solution will be more practical for me.

1

Here’s a possibility:

Create a bat file with the following content.

Change the variables for your environment.

@ECHO OFF
SET banco=DATABASENAME
SET pasta_destino=C:\ARQUIVOS
SET servidor=SERVERNAME
SET instance=SQLEXPRESS2008r2
SET username=USERNAME
SET password=secreta


bcp "SELECT * FROM %banco%..TABLE1" queryout %pasta_destino%\table1.csv -S%servidor%\%instance% -U%username% -P%password% -t; -c

This will generate the file TABLE1.CSV in folder C: FILES

Now run the following command in SQL-Server to import the records to table 2

BULK INSERT dbo.TABLE2
FROM 'C:\ARQUIVOS\table1.csv'
WITH
(
     FIELDTERMINATOR = ';',
     ROWTERMINATOR = '\n'
)
GO
  • Thank you Reginaldo. It worked with less than half the time I was waiting with my cursor.

  • I already knew that this process with bcp and Bulk Insert would be fast, but I never tested with this amount of records. Could you tell us how long each process took and accept the answer? rsrs

Browser other questions tagged

You are not signed in. Login or sign up in order to post.