MD5 is good enough?

Asked

Viewed 389 times

1

I’m working on a legacy system, which has a database with about five years of accumulated records with no normalization. Among other things, its purpose is to allow users to write and post posts in the same style as Twitter, but without the limitation of 140 characters. Some users are complaining that their posts are appearing several times. The system had a serious problem at the time the posts were registered, causing the same content to be inserted several times.

At the system level I was able to correct the problem, but now I have a gigantic table of posts, and many of the records have 2... 3... up to 6 copies. The correct thing would be to remove the duplicates, but at first, what I want to do is simply select a post of each.

To illustrate:

Atualmente está assim

We note that records 1 and 2 are from the same user, and have exactly the same body.

The only column that can identify that a post is the same as the other with 100% certainty is the corpo, who’s kind TEXT. Unlike the example (which was just to illustrate), this column can easily have its 1000 characters, so use SELECT DISTINCT ON (body) * FROM posts; I don’t think it’s a great idea.

So what I thought was to create a STORED PROCEDURE to generate the MD5 from the body of all posts, which would be stored in an indexed field. Then I could do the SELECT DISTINCT based on this hash MD5.

The table would then look like this:

Tabela após a alteração

With the body_hash being an indexed field, I could do something like

SELECT DISTINCT ON(body_hash) * FROM posts;

I am in doubt because this table has millions of records. MD5 is good enough for my situation? Or is there something better I can do to select only one record of each?

  • You either encrypt the field or create a unique key for it?

  • This column corpo is your primary key? Give more details of the purpose of this.

  • Not @bigown. It is the database of a Ruby on Rails application, so the primary key is ID.

  • @Pedrovinícius at least we already know that there is another column that is 100% reliable that is different. Doesn’t it fit you? MD5 may be useful, depending on the need, the question does not make this clear. Why do you need to know the difference? Will use this once (or well, eventual) to solve the problems, or will it be something that will be in the application?

  • @bigown, even if the body of the posts are equal, the ID will always be different. In case what I want to do for now is ignore the repeated posts, and to identify the repetition, the only field I can see is the body :/

  • @Pedrovinícius still has a weird thing in the text. Is this column always different? Or she can be different? You want to use it to find out which row is different without looking at the other columns?

  • @Bigown, I’ll go into more detail.

  • @bigown, I improved the description there.

  • Then you will do it only once to solve the past problem?

  • Simple,md5 is not good enough to want to protect information. If you just want to compare information, then fine, but if you want to protect the information, then no.It’s one-way, encrypts the information but doesn’t decrypt it.You’d have to do other means to get back this md5 encryption. I do not advise to work with md5,I’ve had experience with this function and did not help me much.

  • @bigown, yes, my idea is to generate the MD5 for all lines only once, since at the system level (in the method of creating posts), the problem has been fixed.

  • @Falion, my case is not about protecting the data. It’s just to generate a more compact version of the text and use in SELECT DISTINCT. I think it’s lighter than doing the same procedure on top of the body field.

  • If you do not wish to remove the duplicates physically, you can remove them logically. Create a field that will act as a flag (something like Deleted or disabled). If the value is true, this record is disabled or deleted. In the query that returns the posts, you ignore the disabled records. To mark the disabled the MD5 strategy is not bad (in SQL Server I use CHECKSUM for these cases). Remembering that even if it is time consuming (by the amount of records), you will do it only once, since you have fixed the system not to duplicate records.

  • 1

    I may be talking nonsense, but wouldn’t it be more efficient to check the body and the ID and even a team as a 30-second limit, to even detect possible SPAM attempts? I understand that a hash could solve, but depending on how generated the hash will be the same as doing nothing (hypothetically), I do not know if I understand the case, but I mean if using wrong will only have more work to solve.

Show 9 more comments

2 answers

3

Reliability of the MD5 as a single

If you want to trust 100% on MD5 to find out if it’s the same, know you can’t trust it. The MD5 can be a basis to already be sure that it is different, it is reliable when it is different. But two MD5 codes can be the same for different contents.

I would say that MD5 has a well over 99.99999999% chance of working, but not 100%. If it is a register full of problems, it will solve everything, or in a very exceptional situation, it will miss something, which is already a significant gain. But if it’s just to fix the problem once and it won’t be useful for other things, the cost of generating the MD5 doesn’t even make up for the effort. Generating an MD5 code is "expensive".

Real solution

Index the same text column, solve the problem and delete that index. It is easier and more reliable.

It will end up generating false positives with this. What if the user wanted to post the same message more than once? It will erase a legitimate message. And I’m considering that in this DISTINCT the user identifier will be considered.

If new duplicates are to be avoided

This is more hypothetical, it doesn’t seem to be the case.

If you want to prevent new columns from being repeated, you can even do this to buy time. But if you find a repeat, you have to check the full text to make sure there was no collision. This verification in practice will almost never occur, after all almost 100% of the times the SELECT will return 0 distinct lines and you will not need to do anything else, even in this case, it will be quick to check those that return, if everything is fine just return one line. So it can be efficient. I wouldn’t use it like this.

Actually using the text field will be fairly efficient in processing as well. Not in space occupied by the index and the time spent on accessing it. It can cause cache damage. This is the real problem of using the very large column.

But understand that any insertion or updating of data from this column will be considerably slower because of MD5. But it doesn’t mean it’s a problem. Take a test if you can afford this cost.

There is a trick that can help, but I do not know if it is the case, anyway it will not guarantee the 100%. Index the hash along with a short excerpt of the text. It may be the (s) first(s) character(s), or another passage that is more likely to be different, but it needs to be an excerpt that is guaranteed to be present in all texts to be useful. It is much rarer to collide with texts with equal passages. This only improves the chance. And it will increase the consumption of space, so the utility is questionable.

2


MD5-Message Digest 5 is a Hash algorithm.

Hash algorithms aim to maintain data integrity and work more or less like this:

You have an input M, a Hash H function and an output (hash code) H(M).

M->H->H(M)

A Hash algorithm is considered safe if different inputs result in different outputs, i.e., if you change a bit, the output must be totally different.

The MD5 algorithm some years ago was safe, but today there is a better algorithm.

The weak point in the MD5 algorithm is in the generated hash output, which is 128 bits, because it is small, can generate conflicts, that is, different inputs can generate equal outputs, thus breaking the security of the algorithm.

What is the alternative to MD5?

There are several alternatives, but one of the best known is the SHA-2 Family algorithm.

It’s safer than MD5. The hash code (output) generated has a larger range, can be 224, 256, 512 etc bits, with the most difficult code is different inputs generate equal outputs.

Completion

I think it might be a solution to your problem.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.