Save words with python accents

Question

Save words with python accents

Asked 6 years, 7 months ago

Viewed 1,386 times

0

I have this json file maker:

{"certa": 1, "vez": 7, "quando": 13, "tinha": 6, "seis": 7, "anos": 6, "vi": 4, "num": 4, "livro": 3, "sobre": 6, "a": 47, "floresta": 1, "virgem": 1, "hist\u00e3\u00b3rias": 1, "vividas": 1, "uma": 31, "imponente": 1, "gravura": 1, ... }

The above file data is saved as follows:

    with open(nameFileJson + '.json', 'w') as arq:
        json.dump(data, arq)

Where the file name for the variable is given nameFileJson and data is a string with the text that will be processed to count the number of words to be added to the json file. That is, we will have a dictionary of words and frequencies. This part is right.

I read the json file that way:

with open(nomeFile + '.json') as json_data:
    dicContadores = json.load(json_data)
    json_data.close()

return dicContadores

I need words to continue to be saved accentuated. How to resolve this?

1

with open(nameFileJson + '.json', 'w', encoding='utf8') doesn’t solve?

– fernandosavio

2018/12/18 at 22:12
not "fernandoavio. I just tested and leaves the same

– Walt057

2018/12/18 at 22:17
Tested on recording and reading as well?

– fernandosavio

2018/12/18 at 22:19
yes I tested too

– Walt057

2018/12/18 at 22:29
"data is a string with the text to be added to the json file" - that’s not how json.dump works, it expects a dictionary and will dump all the contents of the dictionary in the file, overwriting if something is written (mode w). You can edit your question and include a Minimum, Complete and Verifiable Example to see exactly what you’re doing and what the problem is?

– Pedro von Hertwig Batista

2018/12/18 at 22:32
How was this JSON generated? Assuming that the problematic word is "history", the excerpt \u00e3\u00b3 does not correspond to the letter "oh". It would be right \u00c3\u00b3 (assuming that you are in UTF-8)

– hkotsubo

2018/12/18 at 23:01
Yes, Pedro is right. I haven’t explained correctly

– Walt057

2018/12/18 at 23:03
You are on ANSI @hkotsubo

– Walt057

2018/12/18 at 23:05
I meant that if the word is "stories," then "oh" shouldn’t be written as \u00e3\u00b3 (equivalent to the bytes e3 b3). I took a test converting "ó" p/ various encodings and what comes closest to that is the UTF-8, which produces c3 b3 (and not e3 b3). The bytes e3 b3 may represent different characters in other encodings (see), but in none of them represent "oh". So the string has already been incorrectly generated at source, and without knowing how it happened, we have no way to fix.

– hkotsubo

2018/12/19 at 12:45
people - the type coding \uxxxx does not use UTF-8 codes, but rather the code of the direct Unicode codepoints. The letter ó, in a file that has not undergone incorrect encoding transformations, it should appear as " u00f3" (without any other sequence). The last two digits of codepoints are equivalent to "latin1 encoding".

– jsbueno

2018/12/20 at 12:12
The crazy example sequence for "oh" happens if the text is past tense for utf-8 and then treated as if it were in latin1: json.dumps("ó".encode("utf-8").decode("latin1")) - exit: '"\\u00c3\\u00b3"'

– jsbueno

2018/12/20 at 12:14
@jsbueno In the question the first byte is e3, and not c3. If it was an AP typo, it’s explained. Otherwise, I don’t know what it could be...

– hkotsubo

2018/12/20 at 12:23

Show 7 more comments

1 answer

Browser other questions tagged python json date character-encoding

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2018-12-20T12:09:22+00:00

Python’s JSON Module encodes text using "ensure_ascii" by default - this causes all accented characters to be encoded in the form "\uXXXX".

In order for the functions of the Python json module to write their own letters instead of using this escape sequence, simply pass the parameter to them ensure_ascii=False.

That is, in your code, change

json.dump(data, arq)

for:

json.dump(data, arq, ensure_ascii=False)

Text will be saved in utf-8 encoding (remember that by default, programs in windows environment can try to open the text as if it were in latin1 - if the accents appear incorrectly, the best thing to do is to change the configuration of these programs to interpret text as utf-8, and not to touch the utf-encodingJSON 8, which is standard for this file type)