Regex to catch files with certain characters in the name

Asked

Viewed 165 times

1

I was in charge of creating a code that navigates to the server folders looking for files with wrong nomenclature (Accents, spaces, punctuation and special characters..) where I have broken through the logic of regex and cannot progress because all the files have an extension ex: .txt, . pdf, . doc among others.

And that blessed point makes my code simply stay in that dictated "It’s either 8 or 80" because in the regex the expression \W captures all other characters I want (all non-alphanumeric) but as the "." comes together in this capture, files that have the correct nomenclature as for example: .txt file are accused as files with wrong nomenclature on account of the blessed point.

Follows the code:

import os, re;

def encontraArquivosEmPastaRecursivamente(pasta):
 arquivosTxt = []
 caminhoAbsoluto = os.path.abspath(pasta)
 for pastaAtual, subPastas, arquivos  in os.walk(caminhoAbsoluto):
     arquivosTxt.extend([os.path.join(pastaAtual,arquivo)
                         for arquivo in arquivos
                         if(re.findall(r'[áàâãéèêíïóôõöúçñÁÀÂÃÉÈÍÏÓÔÕÖÚÇÑ\s\W]', arquivo))])

 arquivo = open('lista_de_arquivo.txt', 'w')
 for arquivosTxt in arquivosTxt:
     arquivo.write(arquivosTxt + '\n')
 arquivo.close()

encontraArquivosEmPastaRecursivamente('c:/Users/paulo/Desktop/Ambiente_de_arquivos')

Filenames that should be in the file "list_de_file.txt" after the code runs:

1.txt file (file with spaces);

archive1! @#$% &()_+` {[ª ~}]º,. ;-. txt (file with special characters);

file 1.txt (file with accents);

In turn, what should not be on the list:

file 4.txt (file without space, without accent and without special character) (but it appears on the list because of ".")

This is my test environment:

Essas são as pastas que o código verifica

Ignore "pasta1" and "pasta2" are just to test code recursiveness.

1 answer

0


Instead of making a regex that takes what it wants, you can choose to take what you don’t want:

r = re.compile(r'^[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)?$')
arquivosTxt = []
caminhoAbsoluto = os.path.abspath(pasta)
for pastaAtual, subPastas, arquivos  in os.walk(caminhoAbsoluto):
    arquivosTxt.extend([arquivo for arquivo in arquivos if not r.match(arquivo) ])

I mean, I don’t want a file that only has unstressed letters and numbers ([a-zA-Z0-9]), optionally followed by an extension (\.[a-zA-Z0-9]+ - point followed by one or more letters or numbers) - the ? soon after it becomes the extension optional.

The markers ^ and $ indicate the start and end of the string, ensuring that I will check the entire file name.

Then, just look for those who do not satisfy the expression (if not r.match). If the file name has anything other than accented letters and numbers (other than the extension), it will be included in the list.

  • It was exactly this logic that I couldn’t think of, I’m very new to Regex, Thank you very much for your help !!!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.