Accent error while saving Python file

Question

Accent error while saving Python file

Asked 8 years, 6 months ago

Viewed 2,731 times

0

I’m not being able to save a file with an accent in python, I’ve come to ask for your help;

import csv


f = open('output.txt', 'w')


data = []

def parse(filename):

    with open(filename, 'r') as csvfile:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:             

            f.write("%s\n" % line)

parse('Soap.csv')
f.close()

always have as a result strings like:

Notes xe7 xf5es

and would like the exit to be like:

Remarks

What happens if you do f.write("%s\n" % unicode(line, "utf-8"))?

– Woss

2017/04/10 at 23:58
coercing to Unicode: need string or buffer, list found

– rodgomesc

2017/04/11 at 00:08
each line I’m saving is a list

– rodgomesc

2017/04/11 at 00:09
Then convert to string before and apply Unicode: unicode(str(line), "utf-8") and see if that solves the problem.

– Woss

2017/04/11 at 00:54
@Andersoncarloswoss, no mistake, but the exit was the same

– rodgomesc

2017/04/11 at 01:51

1 answer

Browser other questions tagged python character-encoding encode

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2017-04-11T02:35:06+00:00

So - the biggest problem there is that you have at hand an object list, returned by the "Reader" iteration - and is trying to write this list directly in an output text file, converging it in string only with the operator % on the line f.write("%s\n" % line).

This string list transformation (even if you were using the method .format string instead of %), uses internal representation (repr) of each element in the list - not its representation with str). If it were in Python 3 its code would have worked, because the internal representation for simple accented characters displays the same, instead of the escape encoding ("Xhh" for the Python 2 byte-strings, uhhhh for Python 3 text strins).

However, the correct thing is to write each string in the list separately - ensuring that Python internally uses the representation given by str - adapting your code, it can look like this - assuming you wanted the output file to be read exactly as your code tries: in each row list of strings using Python syntax:

import csv

def parse(filename):

    with open(filename, 'r') as csvfile, open('output.txt', 'w') as f:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:

            f.write("[%s]" % ", ".join("'%s'" % field for field in line)  )

parse('Soap.csv')

Note that I found another glaring point of your code that is to open a file to write in the body of the module, and close it in the body of the module, without any error treatment, and with the function assuming the open file as a global variable.

If the file is used in more than one function, or in more calls to the same function: (1) create another function to encapsulate all the calls that will be saved in the file; (2) preferably the "with" command to open (and automatically close) the writing file; (3) pass the opened file as parameter, explicitly, for all functions that will use it.

Now, as I mentioned before, this code uses Python 2, and it works almost by chance. Why you are handling text data-both from your input file, when it comes to output - without trying to decode the read data or encode the writings for a specific encoding. And this kind of thing that makes Python 2 so difficult - people assume that it’s "right", but the "xe9" can be either an "is" if the encoding is "latin1" or any other character if the encoding is Greek, Cyrillic alphabet, Hebrew or other language.

In Python 2 the csv module is very limited to enhance with real text - gather the manual decoding of each element after reading. In Python 3 it already decodes the text automatically.

So, assuming you are reading a CSV file in Latin 1 and want its output in utf-8, for example, you can do so:

import csv
INPUT_CODEC = "latin1"
OUTPUT_CODEC = "utf-8"
def parse(filename):

    with open(filename, 'r') as csvfile, open('output.txt', 'w') as f:

        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect=dialect)

        for line in reader:
            line = [field.decode(INPUT_CODEC) for field in line]

            f.write("[%s]" % ", ".join("'%s'" % field.encode(OUTPUT_CODEC) for field in line))

parse('Soap.csv')

Already in Python 3, you pass the encodings when opening the files, and Python does the Decoding and encoding for you. If it does not pass it tries to use appropriate default values from the operating system context:

with open(filename, 'r', encoding="latin1") as csvfile, open('output.txt', 'w', encoding="utf-8") as f:

There is still another question - if your strings contain line breaks and possibly some other characters, these line breaks ("n") will go straight into your output file, making them hard to read - and syntactically invalid as "a string Python list" - that is, if in your CSV you have something like 'palavra 1; "batatinha quando nasce\n esparrama pelo chão"; palavra 3, "enter" inside the second column will be read correctly by the CSV reader (because of the quotes) - and will be saved in your output file.

To avoid this, you can escape the line break and some more special characters in the output file: this is converting characters that compromise the file structure into replacement sequences, which cause no problems for the file and are interpreted in the back - one of the ways to read your output file and'making an "Eval" on each line, for example. A safe way is to use the methods urllib.quote to record each string and urllib.unquote - but this will require a retracing in reading, and will generate a hard-to-read and edit file by hand. Another way is to just swap each " real" for two " ", and then all " n" (a single character with decimal code 10) for "\\n" (two characters, a " " and "n") - that way when Python does an "Eval", it will read the sequence " ""as a single " " and will interpret the " n" in the text file as a single character "new line".