Delete duplicate files

Asked

Viewed 224 times

1

I am creating a script to scan my system and delete all duplicate files. To not have problems in the test phase I changed the function of delete to copy and I am testing in a specific folder, to make sure that everything is ok, but gives an error that I do not know how to solve, and always complain about the last file, no matter what it is.

appears this:

 Traceback (most recent call last):\Users\Unknown\AppData\Local\Programs\Python\Python37-32\lib\filecmp.py", line 51, in cmp
    s1 = _sig(os.stat(f1))
FileNotFoundError: [WinError 2] O sistema não pode encontrar o arquivo especificado: '82110225_2471280396417255_4533531125607301120_n - Copia.jpg'

That is the code:

import os, shutil, filecmp,itertools

files = os.listdir('D:\\Scripts_Python\\Nova pasta\\')
extension =('.jpg')
for filename in files:
    if filename.endswith(extension):
        for f1, f2 in itertools.combinations(files,2):
            comp = filecmp.cmp(f1, f2,  shallow=False)
            if comp == True:
                shutil.copy(f2,'D:\\Scripts_Python\\Nova pasta\\Nova pasta\\')
                break

1 answer

1

All indicates that you are not using the full file path that you want to copy/delete, the function os.path.abspath() can be used to make this "mounting".

Alternatively, you can use the library hashlib to calculate the signature of the files in order to identify the duplicate files.

The calculated signature can be used as the key of a dictionary where the values would be the list of duplicated files, see only:

import os, hashlib

def duplicados( path, extension ):
    ret = {}

    # Para cada arquivo no diretorio
    for filename in os.listdir(path):

        # Somente arquivos com a extensão desejada
        if filename.endswith(extension):

            # Monta o caminho completo do arquivo
            fullpath = os.path.abspath(os.path.join(path, filename))

            # Calcula o hash MD5 (assinatura) do arquivo
            with open(fullpath,'rb') as f:
                md5sum = hashlib.md5(f.read()).hexdigest()

            # Adiciona arquivo em um dicionario de listas
            # no qual a chave eh a assinatura do arquivo
            if md5sum not in ret:
                ret[md5sum] = []
            ret[md5sum].append(fullpath)

    # Filtra e retorna somente arquivos duplicados
    return { k:v for k, v in ret.items() if len(v) > 1 }

print(duplicados(path='D:\\Scripts_Python\\Nova pasta\\', extension='.jpg'))

EDIT:

As suggested in the comments by the honourable Member @jsbueno, follows a solution using the libraries pathlib and the filecmp able to solve the problem more efficiently:

from pathlib import Path
from filecmp import cmp as compare
from itertools import combinations
from networkx import Graph, connected_components

def duplicados( path, extension ):
    # Recupera lista de arquivos do diretorio
    # filtrados pela extensão
    files = [str(p) for p in Path(path).resolve().glob(extension)]

    # Recupera lista de pares duplicados
    dups = [(f1, f2) for f1, f2 in combinations(files,2) if compare(f1, f2)]

    # Constroi um grafo a partir dos
    # pares de arquivos duplicados
    grafo = Graph()
    grafo.add_edges_from(dups)

    # Retorna lista dos componentes
    # conectados do grafo (arquivos identicos)
    return list(connected_components(grafo))

print(duplicados(path='D:\\Scripts_Python\\Nova pasta\\', extension='*.jpg'))
  • 2

    I also worried about giving the hashlib tip, to avoid that in a folder with 30 images, the program reads 900 files - but Python’s filecmp is smart and already does it - look at his help: "Return value: True if the files are the same, Otherwise false. This Function uses a cache for Past comparisons and the Results, with cache Entries invalidated if their stat information "

  • 2

    and also, it is not legal to indicate "os.pth.abspath" at this time of the championship - code of this type should use everything possible from "pathlib" - with the methods of "pathlib. Path" - becomes much more practical (and everything in the same place) - the listdir, the extension check, until reading a file and the "stat" are object methods and properties pathlib.Path

  • @jsbueno: thanks for the comments! following edition.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.