Save words with python accents

Asked

Viewed 1,386 times

0

I have this json file maker:

{"certa": 1, "vez": 7, "quando": 13, "tinha": 6, "seis": 7, "anos": 6, "vi": 4, "num": 4, "livro": 3, "sobre": 6, "a": 47, "floresta": 1, "virgem": 1, "hist\u00e3\u00b3rias": 1, "vividas": 1, "uma": 31, "imponente": 1, "gravura": 1, ... }

The above file data is saved as follows:

    with open(nameFileJson + '.json', 'w') as arq:
        json.dump(data, arq)

Where the file name for the variable is given nameFileJson and data is a string with the text that will be processed to count the number of words to be added to the json file. That is, we will have a dictionary of words and frequencies. This part is right.

I read the json file that way:

with open(nomeFile + '.json') as json_data:
    dicContadores = json.load(json_data)
    json_data.close()

return dicContadores

I need words to continue to be saved accentuated. How to resolve this?

  • 1

    with open(nameFileJson + '.json', 'w', encoding='utf8') doesn’t solve?

  • not "fernandoavio. I just tested and leaves the same

  • Tested on recording and reading as well?

  • yes I tested too

  • "data is a string with the text to be added to the json file" - that’s not how json.dump works, it expects a dictionary and will dump all the contents of the dictionary in the file, overwriting if something is written (mode w). You can edit your question and include a Minimum, Complete and Verifiable Example to see exactly what you’re doing and what the problem is?

  • How was this JSON generated? Assuming that the problematic word is "history", the excerpt \u00e3\u00b3 does not correspond to the letter "oh". It would be right \u00c3\u00b3 (assuming that you are in UTF-8)

  • Yes, Pedro is right. I haven’t explained correctly

  • You are on ANSI @hkotsubo

  • I meant that if the word is "stories," then "oh" shouldn’t be written as \u00e3\u00b3 (equivalent to the bytes e3 b3). I took a test converting "ó" p/ various encodings and what comes closest to that is the UTF-8, which produces c3 b3 (and not e3 b3). The bytes e3 b3 may represent different characters in other encodings (see), but in none of them represent "oh". So the string has already been incorrectly generated at source, and without knowing how it happened, we have no way to fix.

  • people - the type coding \uxxxx does not use UTF-8 codes, but rather the code of the direct Unicode codepoints. The letter ó, in a file that has not undergone incorrect encoding transformations, it should appear as " u00f3" (without any other sequence). The last two digits of codepoints are equivalent to "latin1 encoding".

  • The crazy example sequence for "oh" happens if the text is past tense for utf-8 and then treated as if it were in latin1: json.dumps("ó".encode("utf-8").decode("latin1")) - exit: '"\\u00c3\\u00b3"'

  • @jsbueno In the question the first byte is e3, and not c3. If it was an AP typo, it’s explained. Otherwise, I don’t know what it could be...

Show 7 more comments

1 answer

2

Python’s JSON Module encodes text using "ensure_ascii" by default - this causes all accented characters to be encoded in the form "\uXXXX".

In order for the functions of the Python json module to write their own letters instead of using this escape sequence, simply pass the parameter to them ensure_ascii=False.

That is, in your code, change

json.dump(data, arq)

for:

json.dump(data, arq, ensure_ascii=False)

Text will be saved in utf-8 encoding (remember that by default, programs in windows environment can try to open the text as if it were in latin1 - if the accents appear incorrectly, the best thing to do is to change the configuration of these programs to interpret text as utf-8, and not to touch the utf-encodingJSON 8, which is standard for this file type)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.