Best way to eliminate duplicate rows in a mysql table (Best Practice)

Asked

Viewed 23 times

1

Expensive,

I’m doing my first project on Mysql - I’m not an expert yet - so forgive me if the question is too easy.

I am migrating data to the Mysql database - it should have about 4-5 million records - and the table should not go beyond this because I will perform a periodic maintenance eliminating the "expired records".

Another maintenance to contain the number of records is to delete the DUPLICATE records.

for this I am performing the following script:

DELETE t1 FROM contacts t1
INNER JOIN contacts t2 
WHERE 
t1.id < t2.id AND 
t1.email = t2.email;

where CONTACTS is the table and email the reference field that cannot be duplicated.

I tested the script on a test basis (with 1000 records) and it works ok - however when executing it in the first shipment I made of the records - divided into batches with about 500.000 records - I’m already 15 Minutes waiting for the return (and the program just CRASH while I’m writing here.. rs - I am using HEIDSQL).

that is, when the base has 5 million - I believe it will be impossible to carry out this script.

So I would like to know whether there is another way - or a better practice for eliminating duplicates.

thanks for the help! daniel

  • i answered a similar question the other day, with a solution other than DELETE with JOIN, can help you: https://answall.com/a/491093/57220 has other ideas there too, some should help you

  • 1

    @Ricardopunctual first thank you for your time in helping me - I implemented the first part here quickly (find duplicates) - q was another thing I needed to do just to have a "strip" - know qtos are duplicates - before going out erasing - and compare with the erased result - just by warranty - and it worked - and better was practically instantaneous. I’m going to do the second part calmly - and I’ll tell you if it worked out soon. thanks! : D

  • @Ricardopunctual he gave a timeout error with 1m. I will review the code to see if there is nothing wrong, but I think not..

  • good, anyway delete usually takes :(

  • hi @Ricardopunctual good night - I took a little bit because the business is really time consuming.. rs - I discovered the problem q was generating the timeout first - as I said in the question the process I was doing caught - I had to give him a "Kill" to release. After that I tested 8 solutions I found on the net about "BEST WAY TO DELETE REPEATED" - all taking HOURS to perform the process in "only" 500,000 records IN A SINGLE TABLE (it is the delete of simpler duplicates) - yours took EXACT 8s!

  • and gave the correct count! I divided the bases to organize them in excel, but there has a limit of records - so I segmented in batches of 500K. There to remove duplicates is one click only - and the number that Excel deleted is EXACTLY the same. "Checking" the result. I tested so first to see the time - and the result, because in excel I can do the same with 1 base, but with 2, 3.. do not. I will play second base on the bench and run again to see if it keeps the performance, but if you want to leave the answer already (only the final script) - I already give ok to it! thanks!!!

  • good Daniel, and I also use Excel for these conferences also if it helped the idea, can leave a vote on the other answer, and put here your solution illustrating the results, like the times, I think it gets more complete the answer :)

Show 2 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.