Problem to convert ANSI to UTF-8 in Python 3

Asked

Viewed 1,389 times

0

I’ve got some files that have ANSI encoding and accents and "ç" and everything. I need to convert these various files to UTF-8 encoding. Some files get UTF-8 encoding and others don’t, why?

The code I use is as follows::

import codecs

def encode_files(self):
    path_list_files = glob.glob(self.config.path_prepared_scd + r"\*.txt")

    for path_files in path_list_files:
        file = path_files.split("\\")[-1].split(".")[0]
        output_temp = self.config.path_temp_scd + "\\" + file + ".tmp"

        with codecs.open(path_files, "r", encoding="ANSI", errors="ignore") as text_reading:
            content = text_reading.read()
        with codecs.open(output_temp, "w", encoding="UTF-8", errors="ignore") as text_writing:
            text_writing.write(content)

Why some stay correctly as UTF-8 and others do not?

1 answer

5


What is "not staying right"? If you don’t show what’s wrong, there’s no way to say right.

In fact, "ANSI encoding" isn’t even a standard - in Python, it’s only valid on Windows, and gives error on other systems, and indicates the standard Windows encoding the program is running on. Take your same program, same data file, and run a C in Ukraine, and get different results! Since you know what encoding the files are in (if they are shown correctly on a Windows machine), use the fixed encoding name: "latin-1". (Hence, if you take your data and the same program to another PC, one in the cloud, for example, it still runs correctly).

Anyway, putting "errors='ignore'" there is a bad practice - and it won’t amount to anything: for "latin-1" encoding all bytes are valid, so there will never be a Decoding error, and, on the other hand, for UTF-8 encoding, all Unicode characters are valid, then you’ll never make a mistake either. but if were there some error, all that would be achieved with ignore is that you would have data omitted from the output, and would never know about it - the characters would simply be suppressed, without any notification from the program, or anything. See this example forcing accented characters into an encoding that does not accept them:

In [4]: "maçã".encode("ASCII", errors="ignore")                                                                                    
Out[4]: b'ma'

Finally, but not sure, since you don’t say what is wrong, what I can imagine is that some of the files you are trying to contact for UTF-8 already are in utf-8. In this case, you will get a double encoding - accented characters will appear as two characters (if you had a sample of how the error is, you could be sure of that). Either that, or, anyway, the input files are encoded differently than expected.

In case some files are already in utf-8, a simple way to avoid double encoding is to read the file in utf-8 before, and only in case of error read the file with latin-1 encoding -

You can do:

from pathlib import Path

def encode_files(self):
    folder = Path(self.config.path_prepared_scd)

    for path_files in folder.glob("*.txt"):
        output = folder / (path_files.stem + ".tmp")
        try:
            data = path_files.read_text(encoding="utf-8")
        except UnicodeDecodeError:
            data = path_files.read_text(encoding="latin-1")

        output.write_text(data, encoding="utf-8")

The use of pathlib.Path also facilitates the handling of file names and directories - as well as providing the methods read_text and write_text, that for cases where the file will be transferred all at once, they are simpler than using "open", "with", etc...

  • Thank you very much. It worked perfectly and the code got much cleaner. I didn’t know this procedure. Thanks again @jsbueno

Browser other questions tagged

You are not signed in. Login or sign up in order to post.