How to open a Unicode file inside a zip?

Asked

Viewed 428 times

7

I tried to

with zipfile.ZipFile("5.csv.zip", "r") as zfile:
    for name in zfile.namelist():
        with zfile.open(name, 'rU') as readFile:
                line = readFile.readline()
                print(line)
                split = line.split('\t')

but the result is:

b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
    split = line.split('\t')
TypeError: Type str doesn't support the buffer API

How do I open this file as Unicode instead of binary?

  • I also asked the question in English: http://stackoverflow.com/questions/20601796/how-to-open-an-unicode-text-file-inside-a-zip

2 answers

7


If you know the correct file encoding, just use the function decode in the file contents (string if it is Python 2, bytes or bytearray if you are Python 3):

with zfile.open(name, 'rU') as readFile:
    conteudo = readFile.read().decode(codificacao)

As mentioned in a reply to your same question in the OS in English, try to break the content in lines before decoding is problematic, since different encodings represent line breaks differently. However, once you have read and decoded all the content of the file (through the read), you can break it into lines normally once it will be represented as a Unicode string (unicode if it is Python 2, string if you are Python 3):

line = conteudo.split('\n')[0]

Or by means of a regular expression (to support \n, \r or \r\n):

line = re.split('\r?\n|\r', conteudo)[0]

3

The response of the gringos in the SO was

The reason you are seeing this error is because you are trying to mix bytes with Unicode. The argument for split should also be byte-string:

>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']

To get a string Unicode string, use Decode:

>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

If you are iterating over the file you can use codecs.iterdecode, but that will not work with 'readline()`.

with zfile.open(name, 'rU') as readFile:
    for line in codecs.iterdecode(readFile, 'utf8'):
        print line
        # etc

Browser other questions tagged

You are not signed in. Login or sign up in order to post.