How to remove duplicate lines efficiently?

Asked

Viewed 32,556 times

11

I’m normalizing a table here and found that there are duplicate lines.

The way I chose to clear these lines is through the:

CREATE TABLE tabela_nova AS ( 
    SELECT DISTINCT * FROM tabela_antiga
);

The problem this way is that it is very slow and depending on the size of the bank, it takes more than a day.

There would be a more efficient way to eliminate duplicate lines?

  • It was not clear what this duplicate in your table, are codes, names... from an example(select) of how this your table with duplicate values

  • @Sneepsninja were imported data. There are 20 columns and 120 million rows. Some 8,000 lines are duplicated, that is, they have exactly the same values for all columns.

  • You have a backup of this data?

  • Read my answer below, just you understand what I posted in the reply.

  • 1

    @Israelsousa, I’ll read your answer, I haven’t had time yet. Quick

  • to have more efficiency and readability in your select do not put "distinct *" but "distict field", field you want to compare

Show 1 more comment

2 answers

20


Let’s take an example with the Mysql database, with a table called nomes, containing the fields id and nome. This syntax can be used in other databases, we will use the following command:

DELETE a FROM nomes AS a, nomes AS b WHERE a.nome=b.nome AND a.id < b.id

Note that in the SQL command after the FROM I call twice the table nomes, but I differ by the letters a and b. You could name it whatever you want.

Note also that after the WHERE I do the comparison between the columns, checking the duplicity and then say that the id of a must be less than b. This way Mysql will compare all records to the same nome and delete those containing the minor id.

  • nomes: It is the table with duplicate records.
  • nome: It is the field for comparison of records.
  • id: It is the primary key of the table.

See in practice how it happens:

Table with duplicate records

id name
1 George
2 George
3 Gaius
4 Gaius
5 Erica
6 George Moura
7 George
8 Erica
9 George
10 Gaius
11 George
12 John

Applying the SQL command described above:

DELETE a FROM nomes AS a, nomes AS b WHERE a.nome=b.nome AND a.id < b.id
id name
6 George Moura
8 Erica
10 Gaius
11 George
12 John

In case you want to delete all duplicate records, leaving only the unique records is just change the < for !=. For example, by running SQL below in the original table, we would have:

DELETE a FROM nomes AS a, nomes AS b WHERE a.nome=b.nome AND a.id != b.id
id name
6 George Moura
12 John

Source

  • Why this way is more efficient than the "SELECT DISTINCT"?

  • I find it simpler to understand and understand what is happening under the covers. besides that it is also very good for when the Ids duplicates are numerical.

  • But she is faster?

  • I believe so, because you can specify the colums that you want to be returned.

  • I’ll give you the right one, but I’m not sure that’s the answer I wanted. But okay. Abs

  • Very Good.. Helped me. There’s only one thing, I had to change my query a little to make it work.. Veleu!

Show 1 more comment

-3

DELETE a FROM tbl_xf AS a,tbl_xf AS b WHERE a.fk_x=b.fk_x AND a.pk_x < b.pk_x AND a.fk_x=8 AND b.fk_x=8

Browser other questions tagged

You are not signed in. Login or sign up in order to post.