Is giving a "SELECT" before an "INSERT" a safe way of not having duplicate records?

Asked

Viewed 7,830 times

36

I have a table with a column codigo. I cannot set as primary key. In a system it receives many notifications before it checks with a SELECT to see if that code already exists in the table. If it does not exist it gives the INSERT and if there is he does the UPDATE.

It turns out that this system receives many notifications via POST where many of them happen at the same time to the same code.

Often the system, even giving the SELECT before, ends up inserting duplicate codes in the table.

I don’t know how Mysql treats this, so I don’t know if it’s safe to make a SELECT verifier before the INSERT. I assume that there is a kind of queue of queries and that it is being processed one by one.

$rs = $db->query("SELECT COUNT(1) AS tem_codigo FROM tabela WHERE codigo = $codigo");

if ($rs['tem_codigo'] == 0){
    // aqui daria o insert
} else {
    // aqui daria o update
}

How could a code above allow entering two records with equal codes? Imagine a request with a notification happening at the same time, at exactly the same time.

What’s the best way to get around this without using a primary key?

  • 2

    You have, or can create, a UNIQUE key in the code column?

  • I can’t touch the seat structure =/

  • Have you ever tried to do this by using Procedure ? Sometimes this can solve, since SQL is faster to run SELECT than the same code.

  • Wouldn’t be the case to check the code before giving the insert, because then the routine would check if it exists if it did not give the insert?

  • 1

    Yes there can be duplication of records. Two simultaneous selects will return the same code. Don’t think in terms of "processing queue" BD doesn’t work so well. I suggest studying competition, transactions and level of isolation. The best solution would be the code field to be auto increment so you ensure that the base will handle this.

8 answers

24

A solution would be to lock the table with the command LOCK TABLES within a transaction. Obviously someone can argue that this is bad for performance. And it is, because you are queuing all the requisitions in a single row and not by code, as would be ideal.

A more suitable solution would be to use the command INSERT ON DUPLICATE, that allows to carry out the UPDATE in case of the existence of the record.

Example:

INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;

However, as you can not put a PK in the table, this solution ends up becoming unviable.


Update

An alternative to the second solution that does not involve changing the table would be to create another auxiliary table only with the code in question as PK. The code steps would be:

  1. Ensure the existence of the code in this alternative table using the INSERT DUPLICATE KEY
  2. Start a transaction
  3. Perform a SELECT FOR UPDATE in the alternative table record record log to block the record in question
  4. Normally enter or update in the definitive table, because at that time other requests will be blocked
  5. Effective the transaction

Although this solution seems very complex, in the case of many competing operations it will bring the advantage of do not lock the entire table. As I mentioned, blocking the table will queue all requests, while blocking by code will allow different codes to be entered or updated simultaneously.


Final remark

Statements like "can’t move" are a little complicated. If someone is trying to stop you from making changes because of some bureaucracy, then as a good software engineer you should argue that improvements and fixes require changes.

  • 1

    From what I understand, by the lack of single key, this condition (ON DUPLICATE KEY) would never be executed...

  • @J.Hudler Right, I was completing the answer. But precisely why I started with the lock solution.

9

The scenario of two operations happening exactly in the same timestamp is remote, but it can happen.

Since you cannot make changes to the database, what I suggest is that you lock the table before performing any operation using Mysql LOCK TABLES:

LOCK TABLES `minhaTabela` WRITE;
/*!40000 ALTER TABLE `minhaTabela` DISABLE KEYS */;
INSERT INTO `minhaTabela` VALUES ...
/*!40000 ALTER TABLE `minhaTabela` ENABLE KEYS */;
UNLOCK TABLES;

This is not very advantageous in terms of performance, since the operations in the table will be blocked until the previous request has been completed. In short, you’ll be creating a single stock queue of the kind FIFO (first come first served).


The ideal scenario is to apply extra indexes in the table for the purpose of performance, i.e., how the query is performed in the field codigo, it should be indexed.

On the other hand, you describe the field código cannot be repeated, and should therefore be marked as UNIQUE to avoid duplicates.

I don’t know if when you say you can’t touch the table it’s because you can’t access the management of it, if that’s the case, you can always run a query to change that particular field by assigning it the UNIQUE:

  • Mark field as UNIQUE

    ALTER IGNORE TABLE minhaTabela ADD UNIQUE (codigo);
    
  • Mark field as index

    ALTER TABLE minhaTabela ADD INDEX (codigo);
    

If you cannot access the management of the table and cannot execute queries that change the structure of the table, I advise you to talk to who can to do it, or as the data in the table grows, the application will become slower and slower.

7

To think a little outside the box:

Is there any restriction that the code column has to be numerical and sequential? By the way, what is its type? If it is numerical, what is the maximum value it supports?

Sometimes we forget to ask this question and there may be different solutions.

Your problem is to avoid collision between two records in this column. One way to do this would be to record data that has a larger distribution between them. For example, instead of taking the "next value", store the hash of multiple identifiers. That is, save the value of MD5 (user + timestamp + anything). The probability of collision is much lower, and so the stored values would end up being unique without you having changed anything in the database, just the logic of your application.

  • 1

    The idea is very good. However, I am in favor of not performing the MD5 hash, or any other hash. I have this opinion why, even though having low probability, HASHS can have collisions. MD5, as in the example given, has a probability of 1/(2 128), which is "easy" to happen. The given example, without the MD5 HASH, is less likely to cause conflict: user + timestamp + algumacoisa (which may be an Rand(time()), for example)

  • Oops. I agree with you, that in an ideal solution, if it is not to have collisions within a column, the ideal is to set as Unique, or auto-increment. I only thought about this solution because he said he can’t change the bank (and I figured if he can’t change this column, he can’t also create another table, etc)

  • The code comes from the notification so I also have this limitation.

  • 2

    "... probability of 1/(2 128), which is "easy" to happen..." It’s not, it’s an absurdly high number something like 1 in a billion billion billion billion in 256.

4

When it comes to multiple connections to the same database, with data being sealed and entered all at the same time, you have to really worry about transactions and when a certain group of data will or will not exist.

As has already been said, using and understanding the TRANSACTION ISOLATION LEVEL is very important. In Mysql, InnoDB has some levels of isolation of transactions. They are REPEATABLE READ, READ COMMITTED, READ UNCOMMITED, and SERIALIZABLE.

Transactions

  • Repeatable Read - When the insulation level is repeatable read, that is to say that within the same transaction, any data that is read will be kept the same. I mean, no matter how many times you make one SELECT, will come back exactly the same thing, even if something else does a UPDATE in the same data.

  • Read Committed - This level of isolation means that each 'reading' (SELECT) look at all that is already COMMITed, ie, transactions that have already ended. Thus, you will never have data that has been raised within a transaction, before the transaction ends.

  • Read Uncommitted - This level is also known, at least here in the US, as 'Dirty Read' (Dirty Read). This is because it allows any SELECT can read data that were placed in the table, but are not eaten. I mean, if it rotates a SELECT twice, is able to give different results, because the data are not 'concrete'.

  • Serializable - Almost equal to REPEATABLE READ, serializable is used when you want to be absolutely sure that within a transaction, the data is always exactly the same. There are other more technical differences too.

Source: Mysql 5.0 Reference Manual

3

Without changing the base just by changing the dynamic query would be something +- like this:

begin tran with isolation level read commited

update table where id = @iD

if (@@rowcount = 0) -- se não atualizou

insert into table

commit tran

@@rowcount is a global MS SQL variable that returns how many records were affected (I don’t know if Mysql has anything like it but the logic would be the same)

The solution would be to leave everything within a transaction with a very restrictive isolation level, possibly by locking the table to avoid Dirty reads from other selects.

Dai does the Insert or update within the same query (same transaction) which is the only way for you to ensure that another concurrent command will not read or write to the same table.

3

It would be right to use transactions using the correct transaction isolation method to encapsulate transactions.

For your case, when you start the transaction you need to see if another transaction changed the data, then you can put the isolation like this:

SET TRANSACTION ISOLATION LEVEL READ COMMITED

To start a transaction implicitly in the next query, run SQL:

SET AUTOCOMMIT=0
SELECT ...
INSERT ...

or

UPDATE ...

and finally:

COMMIT
SET AUTOCOMMIT=1

2

I have this problem.

In cases like yours where changing the bank is not an option I simulate the sequences Oracle in Mysql. For this I create a table tb_seq_sequence, in this table I keep the table name and a numerical value, that would be the current value.

Then I encapsulate the access to this table in a class Sequenceutil, where it is responsible for recovering the current value and then increments it.

There is the possibility of two people accessing this quence at the same time, but it is very unlikely. I have always used this type of solution and until today it has served me super well.

If you want something foolproof you can use the @Zuul solution and lock the table quence. This sure won’t cause you any trouble.

1

If you can use SELECT within a transaction, it is safe yes.

If you can create a trial, you can pass only the values to it and the trial takes care of doing what you have to do, without needing the bank to return values and you need to check them with your php programming.

Simple Mysql Trial Example:

Creating a test tabelinha and a test dough:

CREATE TABLE `cliente` (
  `id` int(11) NOT NULL,
  `nome` varchar(255) NOT NULL
);

INSERT INTO `cliente` (`id`, `nome`) VALUES
(1, 'Alexandre'),
(2, 'Maria'),
(3, 'Pedro'),
(4, 'Paulo'),
(5, 'João'),
(6, 'Zaqueu'),
(7, 'Marcelo'),
(8, 'Marcela');

Procedure that will check with a SELECT if the ID already exists and make an INSERT, and if it does not exist will make an UPDATE:

DELIMITER //

CREATE PROCEDURE verify_id_and_insert( IN esteid int(11), IN estenome varchar(255) )
BEGIN
  DECLARE EXIT HANDLER FOR SQLEXCEPTION ROLLBACK;
  START TRANSACTION;

   SELECT id INTO @verificaid FROM cliente WHERE id=esteid;

  IF (@verificaid IS NULL) THEN
    INSERT INTO cliente(id,nome) VALUES(esteid, estenome);
    COMMIT;
  ELSE
    UPDATE cliente SET nome=estenome WHERE id=esteid;
    COMMIT;
  END IF;
END//

DELIMITER ;

Examples of Procedure call:

Example that makes an INSERT because there is no id = 9:

CALL verify_id_and_insert(9,'Penélope');

Example that makes an UPDATE by exisitr client with id = 1, called Alexandre, then the same is changed:

CALL verify_id_and_insert(1,'Alexandre Alterado'); 

How the client table will look after the two calls of the created process:

1   Alexandre Alterado
2   Maria
3   Pedro
4   Paulo
5   João
6   Zaqueu
7   Marcelo
8   Marcela
9   Penélope

If you cannot create Procedure, you should start the transaction by running a START TRANSACTION query; then do what you have to do and then a COMMIT; at the end. If there is an error and you want to reverse what you did you should run a ROLLBACK query;

Browser other questions tagged

You are not signed in. Login or sign up in order to post.