Problem accessing an accented file (matching character)

Asked

Viewed 2,326 times

13

I’m trying to list a folder (files, subfolders) in Python [2.7 on Windows XP], and I’m having problems with sharp files. I know the method os.listdir behaves differently if the argument is a single string or a Unicode string. My problem is that I have encoded files in different ways:

>>> import os
>>> os.listdir('teste')
['a\xb4rvore.jpg']
>>> os.listdir(u'teste')
[u'a\u0301rvore.jpg']
>>> os.listdir('teste2')
['\xe1rvore.txt']
>>> os.listdir(u'teste2')
[u'\xe1rvore.txt']

In Windows Explorer, both files look normal: árvore.jpg and árvore.txt. But while the second is listed normally, the first gives an error message no matter how I access it:

def imprimir(pasta):
    print pasta
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub
        else:
            imprimir(sub)

>>> imprimir('teste2')
teste2
teste2\ßrvore.txt
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt

>>> imprimir('teste')
teste
teste\a┤rvore.jpg
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "teste.py", line 11, in imprimir
    imprimir(sub)
  File "teste.py", line 6, in imprimir
    for x in os.listdir(pasta):
WindowsError: [Error 3] O sistema nÒo pode encontrar o caminho especificado: 'teste\\a\xb4rvore.jpg/*.*'
>>> imprimir(u'teste')
teste
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "teste.py", line 9, in imprimir
    print sub
  File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0301' in position 7: character maps to <undefined>

How do I access this other file? I don’t think it is with corrupted name as a\u0301 is a valid manner to take place á. However, I don’t know how to access it, and I have a volume with several files in this format (I can avoid producing similar files in the future, but I still need to process existing ones), I find it impracticable to convert them by hand.

  • Apparently this is a bug in version 2.7 of the language, since the same code posted works perfectly with version 3.3.2 (which I have installed), modifying only the use of the function print.

3 answers

6

To me it seems to be just an encoding problem at the time of printing (called method print) of file/folder names.

Try to use Unicode encoding by changing its print fution as follows (note the addition of .encode('utf8') at the end of the lines with call to print):

def imprimir(pasta):
    print pasta.encode('utf8')
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub.encode('utf8')
        else:
            imprimir(sub)

EDIT: After rereading your question, I think I understand another point of doubt. You are using IDLE to test interactively, but IDLE uses another encoding (in my test here I had created a .py file encoded in UTF-8, so I did not have the same problem). To check the IDLE encoding do the following:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

So, to display the filenames correctly, you should use this same encoding (or change the default encoding in IDLE - which I honestly don’t know how to do). I did the test here, and with the cp1252 encoding the names are correctly displayed:

>>> def imprimir(pasta):
    print pasta.encode('cp1252')
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub.encode('cp1252')
        else:
            imprimir(sub)

>>> imprimir(u'teste')
teste
teste\árvore.jpg
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt
>>> 
  • I tested and is returning the following error where path is a variable with the directory: Traceback (Most recent call last): File "<pyshell#49>", line 1, in <module> print(path) File "<pyshell#48>", line 8, in print print(sub) File "<pyshell#48>", line 2, in print print folder.('utf8') Unicodedecodeerror: 'ascii' codec can’t Decode byte 0xa6 in position 52: ordinal not in range(128)

  • Is your path variable encoded in Unicode? For example, instead of doing path = 'test' you did path = u'test'?

  • It worked now but the file a┤rvore.txt turned a┤rvore.txt.

  • How so "turned"? You mean the internal content has been changed or the console output showed something other than 'tree.txt'?

  • @Zignd: I think I understand your question. I edited my answer to try to include this possibility.

  • Thanks for the answer, it didn’t work for me but gave me the tip I needed to find a solution. I posted in a separate reply.

Show 1 more comment

5


Based in response by @Luiz Vieira, and in that question in the English OS, I was able to find a solution. The problem was not in accessing the file itself, but only by printing its name on the screen. The code below, for example, works normally:

    if os.path.isfile(sub):
        with open(sub, 'rb') as f:
            with open(sub + u'.saida', 'wb') as s:
                s.write(f.read()) # Cria uma cópia perfeita do arquivo original
    ...
imprimir(u'teste') # Cuidado: somente a versão unicode funciona, a outra dá o mesmo erro

However, my IDLE is using encoding Cp850, which apparently cannot properly print matching characters. The output is therefore normalize the file name so that the character pair is represented by a single pre-composite character (\u00e1):

def imprimir(pasta):
    print unicodedata.normalize('NFC', pasta)
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print unicodedata.normalize('NFC', sub)
        else:
            imprimir(sub)

>>> imprimir(u'teste')
teste
teste\árvore.jpg
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt

0

I had a similar problem with accent in python. Maybe this solves:

import sys
sys.setdefaultencoding('utf-8') # ou Latin1 ou cp1552
  • Thanks for the suggestion, but it didn’t work for me. According to the documentation, this method is not normally available for the programmer to use: "This function should only be used by the module implementation site and, where necessary, by sitecustomize. Once used by the module site, she is removed from namespace module sys."

Browser other questions tagged

You are not signed in. Login or sign up in order to post.