Is it efficient to check file modifications by Hash?

Asked

Viewed 1,074 times

7

Setting

I need to implement a file change check between 2 points in my application. *¹

  • Point 1 - Server - I have a folder where are some product images;
  • Point 2 - Mobile Device - I have a catalog app that downloads these images from the Server to a specific folder on your sdcard;

Problem

I would like from time to time to compare the Device images to the Server images, and check if there is any modification, and if there is any re-downloading of the image;

Requirements

  • Synchronization is applied via internet, so one should consider the size of the information trafficked on the network;

Technologies

The technology I’m using is as follows::

  • The Mobile App is on Android;
  • The Webservice that checks and returns the image to the application is in C# (MVC Web API);

Question

One of the options I’ve found to implement this is by comparing hashs. So I wonder if generation of hash of the file on the Device and compare it with the hash Is the file on the Server efficient for this case? Or is there a better and more efficient option? (remembering that the Server may have several generation requests hash simultaneous, this is a lightweight operation for the Server?).

*¹ - the changes that should be relevant are the ones applied in the Server folder.

Note: When I quote "efficiency", I mean: better reliability (I accept the 99.999% of the hash as quoted by @Miguelangelo in the comments) and performance (involving here, time and resources, being them in processing, or in network traffic).

  • I don’t know if it fits your scenario, but maybe it wouldn’t be better to associate each file with a versioning ID? for example: ~/Docs/test.png has version 14 and on the mobile device is version 13 (update required). Files are manageable by application?

  • 2

    Hash algorithms are efficient and, in general, "light" - in quotes because the definition of light is something subjective. As far as I know, many applications do this. There are more sophisticated hash algorithms called checksum that go beyond this, and that are routinely applied to files to see if there has been no change in them (among other things).

  • @Leonardobosquett and if a user changes the versioning ID on purpose?

  • 5

    Two identical hashes do not guarantee that the files are equal, in which case all bytes would have to be compared. You could however rely on the low probability of collisions of some hash algorithms, such as MD5, and take as "right" (99.999% chance) that equal hashes indicate equal files.

  • @Leonardobosquett, the problem is that the folder and the images, can be modified freely, and it can occur that the file is deleted and placed another with the same name, making it unreliable and vulnerable this type of control. Just to illustrate, until then I was doing this control by date of creation and alteration of the files, but the house fell if the original file was deleted and was copied another, older to its place.

  • understood, versioning does not apply. In the case of hash is as Miguel said, it is only necessary to be careful with the collision. Another suggestion I have is to use the class FileSystemWatcher that monitors modifications in a directory, if that’s the case you could come back with your old idea because you would know the date of the modification in relation to the folder.

  • 1

    @Miguelangelo, I understand your placement, but despite the possibility, it is a mild form, and the probability that this occurs is really very remote, to remove this probability, just comparing all bytes, hehe

  • 1

    @Leonardobosquett, I’ve even run some tests on FileSystemWatcher and presented as an option here in the company, more as we are third parties, the possibility to maintain a service, dropped when user, can simply finish the service, the whole control, goes from nothing, and maybe it would not be noticed easily, to be started before major problems. (Users with privileges is bone);

  • @Fernando only one point once I did something similar checked by byte array? will it be efficient in your case, in mine was?

  • 1

    @Harrypotter, in case you say send all bytes, from the image to the server and check if the bytes match? If this is so it is impractical, because the traffic will be very large, think of 100 images of 200 KB, each cycle will be 20000 KB (~20 MB), and more the return of the images that are different, if so is more efficient, each cycle just download them again, without any verification;

  • @Fernando understood, it is even complicated, I would use a base so, where all the altered images were made an update to the device. From what I understand it’s only from the server to the device...

  • Checking the file size (client)/(server) would not solve the problem?

  • @Paulohdsousa, would not be very reliable, since it would not be difficult to be edited and continue with the same size. right?

  • You can have a look here http://msdn.microsoft.com/en-us/library/ms379571(v=vs.80). aspx, search for the title on the page 'Collision Resolution in the Hashtable Class', maybe it will help you

  • 1

    @Fernando I currently use CRC32, size, date and time of filesystem modification - it is a system own file synchronization between machines.

Show 10 more comments

2 answers

4


Hash basically works to confirm the integrity of a data sequence. There are several algorithms for HASH.

  • Collision, different files and Hash alike

    It happens, usually with large files, rarely with small files. But it depends exclusively on the hash algorithm you will use.

  • Lightweight?

    It depends on which hash algorithm you are going to use. You have CRC32 which is usually quite fast, but the collision is more frequent.

  • Solution

If you are going to work with large files, I recommend using SHA1 or MD5, which are not as heavy but not as light. Whether to work with small files, between 1kB ~ 10MB, use CRC32, has considerable performance for a server.

It is worth noting that they are only indications and introductions to your doubt based on my experiences. I recommend you test yourself, compare the results and actually choose the best one for your case.

  • 2

    Yes. A different hash means different content but an equal hash does not mean the same content.

  • Very good his reply @Metalus, but I don’t know if I can clarify this too, but now doing some tests, I could not notice difference of performance (time) between generating a hash in MD5, a simple string (some 5 characters) and another string of 10000 characters. Is there no difference in performance in hash generation compared to input size? Or do I have to increase the proportion of tests?

  • 10000 characters is still small. Try comparing a 60 MB file between MD5 and CRC32. Of course, the processor and its code also influence this. Use the . NET Stopwatch for a better comparison.

0

First we need to understand some things. How much image will you have in folders? File sizes? Because the hash calculation can cause the server or device of the user to be very slow and end up causing Timeout when several users are using and several files at the same time are being calculated, depending on where the calculation is done.

I do not recommend doing this in real time with webservice. What we can think about may be the following. You can create a file (json, xml, txt... is at your discretion) that whenever an image is updated it puts its hash in this list or updates (server side), and your app has the updated list from time to time as mentioned, where it merges between them and can know which file to download because it is different from the list. I see as a way to reduce traffic on the network also and not burden both sides of the application.

As for the hash comparison between MD5, SHA1, CRC32, SHA256 ... you will feel difference in large files or in large amount of files.

I hope I’ve helped.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.