0
I have this json file maker:
{"certa": 1, "vez": 7, "quando": 13, "tinha": 6, "seis": 7, "anos": 6, "vi": 4, "num": 4, "livro": 3, "sobre": 6, "a": 47, "floresta": 1, "virgem": 1, "hist\u00e3\u00b3rias": 1, "vividas": 1, "uma": 31, "imponente": 1, "gravura": 1, ... }
The above file data is saved as follows:
with open(nameFileJson + '.json', 'w') as arq:
json.dump(data, arq)
Where the file name for the variable is given nameFileJson
and data
is a string with the text that will be processed to count the number of words to be added to the json file. That is, we will have a dictionary of words and frequencies. This part is right.
I read the json file that way:
with open(nomeFile + '.json') as json_data:
dicContadores = json.load(json_data)
json_data.close()
return dicContadores
I need words to continue to be saved accentuated. How to resolve this?
with open(nameFileJson + '.json', 'w', encoding='utf8')
doesn’t solve?– fernandosavio
not "fernandoavio. I just tested and leaves the same
– Walt057
Tested on recording and reading as well?
– fernandosavio
yes I tested too
– Walt057
"data is a string with the text to be added to the json file" - that’s not how
json.dump
works, it expects a dictionary and will dump all the contents of the dictionary in the file, overwriting if something is written (modew
). You can edit your question and include a Minimum, Complete and Verifiable Example to see exactly what you’re doing and what the problem is?– Pedro von Hertwig Batista
How was this JSON generated? Assuming that the problematic word is "history", the excerpt
\u00e3\u00b3
does not correspond to the letter "oh". It would be right\u00c3\u00b3
(assuming that you are in UTF-8)– hkotsubo
Yes, Pedro is right. I haven’t explained correctly
– Walt057
You are on ANSI @hkotsubo
– Walt057
I meant that if the word is "stories," then "oh" shouldn’t be written as
\u00e3\u00b3
(equivalent to the bytese3 b3
). I took a test converting "ó" p/ various encodings and what comes closest to that is the UTF-8, which producesc3 b3
(and note3 b3
). The bytese3 b3
may represent different characters in other encodings (see), but in none of them represent "oh". So the string has already been incorrectly generated at source, and without knowing how it happened, we have no way to fix.– hkotsubo
people - the type coding
\uxxxx
does not use UTF-8 codes, but rather the code of the direct Unicode codepoints. The letteró
, in a file that has not undergone incorrect encoding transformations, it should appear as " u00f3" (without any other sequence). The last two digits of codepoints are equivalent to "latin1 encoding".– jsbueno
The crazy example sequence for "oh" happens if the text is past tense for utf-8 and then treated as if it were in latin1:
json.dumps("ó".encode("utf-8").decode("latin1"))
- exit:'"\\u00c3\\u00b3"'
– jsbueno
@jsbueno In the question the first byte is
e3
, and notc3
. If it was an AP typo, it’s explained. Otherwise, I don’t know what it could be...– hkotsubo