1
I’m working on a legacy system, which has a database with about five years of accumulated records with no normalization. Among other things, its purpose is to allow users to write and post posts in the same style as Twitter, but without the limitation of 140 characters. Some users are complaining that their posts are appearing several times. The system had a serious problem at the time the posts were registered, causing the same content to be inserted several times.
At the system level I was able to correct the problem, but now I have a gigantic table of posts, and many of the records have 2... 3... up to 6 copies. The correct thing would be to remove the duplicates, but at first, what I want to do is simply select a post of each.
To illustrate:
We note that records 1 and 2 are from the same user, and have exactly the same body.
The only column that can identify that a post is the same as the other with 100% certainty is the corpo
, who’s kind TEXT
. Unlike the example (which was just to illustrate), this column can easily have its 1000 characters, so use SELECT DISTINCT ON (body) * FROM posts;
I don’t think it’s a great idea.
So what I thought was to create a STORED PROCEDURE
to generate the MD5 from the body of all posts, which would be stored in an indexed field. Then I could do the SELECT DISTINCT
based on this hash MD5.
The table would then look like this:
With the body_hash
being an indexed field, I could do something like
SELECT DISTINCT ON(body_hash) * FROM posts;
I am in doubt because this table has millions of records. MD5 is good enough for my situation? Or is there something better I can do to select only one record of each?
You either encrypt the field or create a unique key for it?
– Marco Souza
This column
corpo
is your primary key? Give more details of the purpose of this.– Maniero
Not @bigown. It is the database of a Ruby on Rails application, so the primary key is ID.
– Pedro Vinícius
@Pedrovinícius at least we already know that there is another column that is 100% reliable that is different. Doesn’t it fit you? MD5 may be useful, depending on the need, the question does not make this clear. Why do you need to know the difference? Will use this once (or well, eventual) to solve the problems, or will it be something that will be in the application?
– Maniero
@bigown, even if the body of the posts are equal, the ID will always be different. In case what I want to do for now is ignore the repeated posts, and to identify the repetition, the only field I can see is the body :/
– Pedro Vinícius
@Pedrovinícius still has a weird thing in the text. Is this column always different? Or she can be different? You want to use it to find out which row is different without looking at the other columns?
– Maniero
@Bigown, I’ll go into more detail.
– Pedro Vinícius
@bigown, I improved the description there.
– Pedro Vinícius
Then you will do it only once to solve the past problem?
– Maniero
Simple,md5 is not good enough to want to protect information. If you just want to compare information, then fine, but if you want to protect the information, then no.It’s one-way, encrypts the information but doesn’t decrypt it.You’d have to do other means to get back this md5 encryption. I do not advise to work with md5,I’ve had experience with this function and did not help me much.
– Falion
@bigown, yes, my idea is to generate the MD5 for all lines only once, since at the system level (in the method of creating posts), the problem has been fixed.
– Pedro Vinícius
@Falion, my case is not about protecting the data. It’s just to generate a more compact version of the text and use in SELECT DISTINCT. I think it’s lighter than doing the same procedure on top of the body field.
– Pedro Vinícius
If you do not wish to remove the duplicates physically, you can remove them logically. Create a field that will act as a flag (something like Deleted or disabled). If the value is true, this record is disabled or deleted. In the query that returns the posts, you ignore the disabled records. To mark the disabled the MD5 strategy is not bad (in SQL Server I use CHECKSUM for these cases). Remembering that even if it is time consuming (by the amount of records), you will do it only once, since you have fixed the system not to duplicate records.
– cantoni
I may be talking nonsense, but wouldn’t it be more efficient to check the body and the ID and even a team as a 30-second limit, to even detect possible SPAM attempts? I understand that a hash could solve, but depending on how generated the hash will be the same as doing nothing (hypothetically), I do not know if I understand the case, but I mean if using wrong will only have more work to solve.
– Guilherme Nascimento