An algorithm for finding repeated files?

Asked

Viewed 1,341 times

4

I need to make a program that finds repeated files on my computer, so that the user decides what action to take with these files (e.g.: delete the copies). For now, I only care about a binary file comparison (i.e., the file is only duplicated if it is 100% the same as another)

I know that searching only by file name is insufficient, since the same file may have been saved under another name.

Is there any algorithm to compare the files?

I imagine that generating the checksum of all the files and comparing all against all is unproductive, because it is not normal to have so many duplicated files. I also imagine that you can not only use the file size. And there may be cases where the file is duplicated more than once.

  • Yes this algorithm exists. Git itself works with something like this, comparing files to see which latest version for example. I believe this link can help you http://algorithms.openmymind.net/search/binarysearch.html

  • Compare the sizes and if they are equal, cksum? Do you want the filesystem search method or comparison method? Or both?

  • @Vitorbraga updated the question, looking for a comparison method

  • the weight and content has to be the same, but the timestamp of when it was created and changed, must be part of the comparison of duplicate files?

  • @Guilhermenascimento files are identical when they have the same content (i.e., the two files have the same bytes) and thus the same size. So no matter the name, timestamp, access date... these are "metadata" of the files.

  • @woliveirajr ok, it was just to get a sense. For there are several types of comparison.

  • @Guilhermenascimento :) yes, yes. I even thought about putting some complication in the future (like the image being the same, even if in another resolution), but I gave up, now it’s already giving enough work. Sorry if the previous comment seemed rude!

  • @woliveirajr quiet, did not seem rude no. I found your question interesting.

Show 3 more comments

4 answers

3


Split:

  1. list everything with basic information: location (on disk/directory), name, date and size;
  2. separate files with the same name (exactly the same, including upper case and lower case);
  3. in the same way files that have the same size (in Bytes);
  4. delete "not repeated" (no equal name or size);
  5. select the "repeat level 1" (name, size and date equal), and apply a checksum to each block separately, mark the REALLY equal;
  6. select the "repeated level 2" (equal name or size and date), and apply a checksum to each block separately, mark the REALLY equal;
  7. select the "repeated level 3" (equal name and size with different date), and apply in each block separately a checksum, mark the REALLY equal;
  8. select the "repeated level 4" (same name or size with different date), and apply in each block separately a checksum, mark the REALLY equal;
  9. with the REALLY equal, present each block to the user to define which ones will be deleted;

I suggest you add a few options: that the user can access the location of each file; open in the default editor for viewing content; can move the "chosen/repeated" file to a specific folder.

An option that I believe is very useful, when selecting only a file (in windows environment, for example), can be used through the context menu (right mouse click), so that is found some REPEATED file of this that was selected.

Just think about ignoring the contents of compressed folders, that is, if the repeated file is inside a ZIP/RAR, it will never be evaluated and therefore will never be considered repeated (put this in the instructions for use of your future application). And then send me a copy to test ;-)

1

List all the files;

For each file, perform the steps below:

  1. Generate a Hash from the contents of the file and store it in a Hash Table;

  2. In case of Hash collision, check that the file is equal to the files with even Hash, byte by byte. If the same, you found a duplicate.

0

You use basename, PHP:

$inicio = "file:///C://";    // Você poder alterar o caminho atraves das pastas.
$arquivo = basename($inicio);    
$file = basename($inicio, "Nome");

function stribet($inputstr, $deliLeft, $deliRight) {
    $posLeft = stripos($inputstr, $deliLeft) + strlen($deliLeft);
    $posRight = stripos($inputstr, $deliRight, $posLeft);
    return substr($inputstr, $posLeft, $posRight - $posLeft);
}

Pick up the contents:

$res = file_get_contents($inicio);

Locate:

$x = @$this->stribet($res,'$file','[1]');

Must pick up files with [1]:

$d = '$this->file($x)';

Function if the file has [1]:

if ($file == "'.$d.'"){
}

It may not be necessary, or it won’t work if it doesn’t work, just say so. This function can only catch file with [1] in name.

0

I don’t think that’s possible. You would have to compare all the files to each other, and the program’s run time would grow exponentially in relation to the number of files.

You can do something that starts by enumerating the files. Then, each file would have to be compared with everyone else. An optimization would be to compare the size, then maybe a checksum, and then if they are still equal compare byte to byte.

For few files will work well, but as the number of files increases the running time of the algorithm will quickly rise to impractical scales.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.