How to save Unicode text in a . JSON file without escape sequence?

Question

How to save Unicode text in a . JSON file without escape sequence?

Asked 5 years, 3 months ago

Viewed 138 times

0

I am creating a program where I need to save in a JSON file a dictionary that contains strings with Unicode characters. See the example below:

import json

data = {"face": "( ͡° ͜ʖ ͡°)"}

with open("file.txt", "w", encoding = "utf-8") as file:
    file.write(json.dumps(data, indent = 4))

The problem is that whenever I save the file, all Unicode characters are converted to their respective codes \uXXX and I need the file to have the original texts.

In the case of the above example, the contents of the file created by the program are like this:

{
    "face": "( \u0361\u00b0 \u035c\u0296 \u0361\u00b0)"
}

I need the characters to remain the same so that the content is visually pleasing to the user. How could I keep the original text ?

What version of python are you using? Are you using #coding: utf-8 at the beginning of the archive?

– Danizavtz

2020/05/18 at 20:53
I’m using Python 3 and the files .py and .json are encoded in UTF-8. But no, I did not comment on this in my code.

– JeanExtreme002

2020/05/18 at 20:54
1

i tested by setre_ascii = False from json.dumps and it worked here, give a peek at this link: https://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence

– Erick Kokubum

2020/05/18 at 20:58
Dude, I don’t know if you noticed, but the file you’re reading isn’t the same one you’re recording. I tested your code and it worked. But I had to change the name of the file to the correct one.

– Danizavtz

2020/05/18 at 21:02
@Danizavtz I know it is not the same that is recording. Those two are really different files.

– JeanExtreme002

2020/05/18 at 21:06
1

Thanks @Erickkokubum, your tip worked to resolve the file writing part.

– JeanExtreme002

2020/05/18 at 22:40
The UnicodeDecodeError probably because other_file.txt was not saved in UTF-8 (I was only able to simulate the error by generating a file in UTF-16 and trying to read as UTF-8).

– hkotsubo

2020/05/18 at 23:34
@hkotsubo Not possible because I am saving the file in UTF-8. I am using Windows notepad, has problem in it ?

– JeanExtreme002

2020/05/19 at 01:03
@hkotsubo I managed to read the file saving it with encoding utf-8-sig. I have no idea the difference between it for the UTF-8, but that was the only encoding that could read the file.

– JeanExtreme002

2020/05/19 at 01:04
So the file is in UTF-8 but was saved with the BOM (Byte Order Mark), this encoding utf8-Sig ignores BOM: https://docs.python.org/3/howto/unicode.html#Reading-and-writing-Unicode-data

– hkotsubo

2020/05/19 at 01:09
@hkotsubo If I save my file in another editor that does not write with this GOOD, it will give error in reading for using the utf-8-sig or I can use this encoding without or with GOOD ?

– JeanExtreme002

2020/05/19 at 01:13
I don’t remember (but I think so), just testing to find out :-)

– hkotsubo

2020/05/19 at 01:23
@hkotsubo Ok thanks. I edited the question so that it stays only with the subject of file creation. I think this is a good question. It would be possible to evaluate it ?

– JeanExtreme002

2020/05/19 at 01:52
Yes, now it has improved. Before it had 2 problems not necessarily related in the same question

– hkotsubo

2020/05/19 at 11:17
Just to complement, follow excerpt from the documentation: "Microsoft invented a Variant of UTF-8 (that Python 2.5 calls "utf-8-Sig") for its Notepad program: Before any of the Unicode characters is Written to the file, a UTF-8 encoded BOM (which looks like this as a byte Sequence: 0xef, 0xbb, 0xbf) is Written. ... On Decoding utf-8-Sig will Skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided."

– hkotsubo

2020/05/19 at 11:38
And finally, if you’re reading/writing files, you don’t have to call read and write, just do data = json.load(file) and json.dump(data, file, indent=4, ensure_ascii=False) (the methods are load and dump, without the "s" at the end - the versions with "s" - loads and dumps - are used to work directly with strings - although it should make no difference in the final result...)

– hkotsubo

2020/05/19 at 12:11

Show 11 more comments

1 answer

Browser other questions tagged python json python-3.x character-encoding unicode

You are not signed in. Login or sign up in order to post.

by JeanExtreme002 • **5,663** points · Answer 1 · 2020-05-19T02:12:05+00:00

Just like the Erick spoke in the comments, this conversion happens in the method json.dumps because of the parameter ensure_ascii which is set by default to True. What this parameter does is to ensure that the output will contain only ASCII characters.

Therefore, set the value of the parameter to False as in the code below so that it keeps the Unicode characters without converting them:

import json

data = {"face": "( ͡° ͜ʖ ͡°)"}
content = json.dumps(data, ensure_ascii = False)

print(content) # '{"face": "( ͡° ͜ʖ ͡°)"}'