How to save Unicode text in a . JSON file without escape sequence?

Asked

Viewed 138 times

0

I am creating a program where I need to save in a JSON file a dictionary that contains strings with Unicode characters. See the example below:

import json

data = {"face": "( ͡° ͜ʖ ͡°)"}

with open("file.txt", "w", encoding = "utf-8") as file:
    file.write(json.dumps(data, indent = 4))

The problem is that whenever I save the file, all Unicode characters are converted to their respective codes \uXXX and I need the file to have the original texts.

In the case of the above example, the contents of the file created by the program are like this:

{
    "face": "( \u0361\u00b0 \u035c\u0296 \u0361\u00b0)"
}

I need the characters to remain the same so that the content is visually pleasing to the user. How could I keep the original text ?

  • What version of python are you using? Are you using #coding: utf-8 at the beginning of the archive?

  • I’m using Python 3 and the files .py and .json are encoded in UTF-8. But no, I did not comment on this in my code.

  • 1

    i tested by setre_ascii = False from json.dumps and it worked here, give a peek at this link: https://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence

  • Dude, I don’t know if you noticed, but the file you’re reading isn’t the same one you’re recording. I tested your code and it worked. But I had to change the name of the file to the correct one.

  • @Danizavtz I know it is not the same that is recording. Those two are really different files.

  • 1

    Thanks @Erickkokubum, your tip worked to resolve the file writing part.

  • The UnicodeDecodeError probably because other_file.txt was not saved in UTF-8 (I was only able to simulate the error by generating a file in UTF-16 and trying to read as UTF-8).

  • @hkotsubo Not possible because I am saving the file in UTF-8. I am using Windows notepad, has problem in it ?

  • @hkotsubo I managed to read the file saving it with encoding utf-8-sig. I have no idea the difference between it for the UTF-8, but that was the only encoding that could read the file.

  • So the file is in UTF-8 but was saved with the BOM (Byte Order Mark), this encoding utf8-Sig ignores BOM: https://docs.python.org/3/howto/unicode.html#Reading-and-writing-Unicode-data

  • @hkotsubo If I save my file in another editor that does not write with this GOOD, it will give error in reading for using the utf-8-sig or I can use this encoding without or with GOOD ?

  • I don’t remember (but I think so), just testing to find out :-)

  • @hkotsubo Ok thanks. I edited the question so that it stays only with the subject of file creation. I think this is a good question. It would be possible to evaluate it ?

  • Yes, now it has improved. Before it had 2 problems not necessarily related in the same question

  • Just to complement, follow excerpt from the documentation: "Microsoft invented a Variant of UTF-8 (that Python 2.5 calls "utf-8-Sig") for its Notepad program: Before any of the Unicode characters is Written to the file, a UTF-8 encoded BOM (which looks like this as a byte Sequence: 0xef, 0xbb, 0xbf) is Written. ... On Decoding utf-8-Sig will Skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided."

  • And finally, if you’re reading/writing files, you don’t have to call read and write, just do data = json.load(file) and json.dump(data, file, indent=4, ensure_ascii=False) (the methods are load and dump, without the "s" at the end - the versions with "s" - loads and dumps - are used to work directly with strings - although it should make no difference in the final result...)

Show 11 more comments

1 answer

1

Just like the Erick spoke in the comments, this conversion happens in the method json.dumps because of the parameter ensure_ascii which is set by default to True. What this parameter does is to ensure that the output will contain only ASCII characters.

Therefore, set the value of the parameter to False as in the code below so that it keeps the Unicode characters without converting them:

import json

data = {"face": "( ͡° ͜ʖ ͡°)"}
content = json.dumps(data, ensure_ascii = False)

print(content) # '{"face": "( ͡° ͜ʖ ͡°)"}'

Browser other questions tagged

You are not signed in. Login or sign up in order to post.