Decoding of file in Python

Asked

Viewed 3,669 times

4

I have a file that is completely written like this after my Crawler also written in Python has saved the data in it:

b'N\xc3\xa3o n\xc3\xa3o n\xc3\xa3o, n\xc3\xb3s iremos sim!' I wonder if there is any way I can get the encoding out of this file and go to Unicode as soon as possible! If possible without having to install any program not to disturb the performance of my Crawler and my execution of this service.

I’ve tried using bytes.Find and then bytes.Decode, but as expected, it goes back to its initial state, and I also realized that the strings have no Decoding commands.

  • Your file in Python is in hexadecimal special characters (\xc3\xa3o), and you want to write them normally, that’s it?

  • Yes @Brumazzidb I need to take these characters and replace them with the Unicode versions of them, only my file has more than 140 thousand lines so then I needed some solution in python!

  • There is no way to decode the contents of this file, it seems that it was written this way by dirty, all the content is structured to be used as string in python, o you will have to read the file and use replace to replace the hexadecimal sequences and quotes - Brumazzi DB 1 hour ago edit

2 answers

2


The "b'" prefix in the representation of your object shows that the text you have at that point in your program is an object bytes, not a text string.

In Python 3 the two things are different - since since they invented multi-byte text encodings, one cannot say that a byte is a character.

The normal workflow in any Python application is:

  1. get your input data;
  2. if the library uq delivered its data no longer delivered them as text, that is, if they are bytes, decode them (decode) so that they become text
  3. process your data
  4. code them again (encode ) and write them on the way out (if that not done automatically - as with text files, for example)

So in your case, assuming the object you showed there is in the variable a, to continue your program just decode these bytes to text (object of type str) in Python 3 and continue your program:

a = b'N\xc3\xa3o n\xc3\xa3o n\xc3\xa3o, n\xc3\xb3s iremos sim!'
b = a.decode("utf-8")
print(b)

In this case, I know that the encoding is utf-8 to look at the encoding: two bytes for a stressed character, and the first being " xc3" is a good hint that bytes represent text encoded in utf-8.

One essential thing to understand is the difference between text (str in Python 3) which is composed of Unicode characters, and bytes, which are sequences of numbers between 0 and 255 effectively stored in files or transferred over the network. For this, be sure to read:

http://local.joelonsoftware.com/wiki/O_M%C3%ADnimo_Absoluto_Que_Todos_os_Programadores_de_Software_Precisam,_Absolutamente,_Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres_(Sem_Desculpas!)

0

A bad thing in Python is to work with string, I’ve had a lot of headaches with Unicode and utf-8.

Getting to the point. In interpreted languages, header comments are often used for "settings" (Do not configure the system).

In Python, normally the first line is reserved for the executable.

#!/usr/bin/python

or

#!/usr/bin/env python

And the second line usually holds coding information:

#*-* coding: utf-8 *-*
#*-* coding: latin-1 *-*

Use only one of them!

The passages *-*, shouldn’t be necessary, but I’ve never used it without them, so I can’t say if it’ll make a difference.

With this line, you can use special characters in your code.

print "maçã é maçã!"

If you want to use Unicode, can use the method unicode, to convert your bytes.

uni_str = unicode("maçã", "utf-8")

But anyway must have the encoding line obligatorily in all files!

  • But what would be the solution for Python 3?

  • @Viniciusmesel. As was said in the reply put the encoding in the file #*-* coding: utf-8 *-* in the second line of the file, independent of the python version, with this you will use the special characters normally

  • Friend, but Decoding is not of my code but of my file that my code processes, as it is in the title of the question!

  • and friend, Unicode does not work in Python 3!

  • has how you pass your code and the file you have to read?

  • The python processor: https://github.com/vmesel/DataProcessing/blob/master/TwitterProcessing/dumps/proc.py and the file: https://www.dropbox.com/s/9epmg4auqvwaqez/dataset.txt?dl=0

  • There is no way to decode the contents of this file, it seems that it was written this way by dirty, all the content is structured to be used as string in python, you will have to read the file and use the replace to replace hexadecimal sequences and quotation marks

  • It’s really, I had to press one by one in Python and process it, thanks for the help!

  • 1

    Negative vote because I don’t know how to work with bytes and text streams in Brazilian and try to blame the language for it.

  • 1

    By the way, the problem is not nothingness to do with the encoding declaration of the source file which is what changes when the type markings are placed #*-* coding: utf-8 *-* at the beginning of the file - this only indicates to the Python parser the encoding of the program itself, and not of the data it treats - and in Python 3 is utf-8 by default (unlike in Python 2 where it is ASCII by default)

Show 5 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.