Fix badly formatted JSON

Asked

Viewed 128 times

1

I receive a totally non-standard JSON from a customer. And as always, the customer is always right :-(

JSON arrives as follows:

{ dominio:casadasloucas.com.br, wp-admin:casadasloucas.com.br/wp-admin, wp-user:fulano, wp-pass: 1234 }

I’m applying a regular expression that I found on the WEB (because I suck at regex), which is the following:

    jsonStr = re.sub("((?=\D)\w+):", r'"\1":',  jsonStr)
    jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"',  jsonStr)

But she ended up more disturbing than helping. JSON, after applying the above regex, was like this:

{ "dominio":"casadasloucas".com.br, wp-"admin":"casadasloucas".com.br/wp-admin, wp-"user":"fulano", wp-"pass": 1234 }

Can someone help me with this regular expression so that it puts the quotes correctly?

  • I recommend reading this post: https://stackoverflow.com/questions/2583472/regex-to-validate-json

  • Could you explain what you’re trying to do? I checked online, and the json provided looks like a valid json, checked on this site: https://jsonformatter.curiousconcept.com/

  • 1

    @Danizavtz Not a valid JSON because the keys and values must be in quotes. The site you tested corrects JSON (by adding the quotes), but if you uncheck the "Fix JSON" option, it is invalid. Testing on other sites (such as https://jsonlint.com) and on Python itself with the module json, we see that in fact these data do not form a valid JSON. Anyway, what one is trying to do in the question is to put the quotation marks in the right places so that the JSON is valid

1 answer

3

The ideal solution is to fix JSON at source. Who sends the data must make sure they are in the correct format. Point. Any other solution (even more so with regex) will be a non-ideal (also known as a "patch" or "gambiarra").

Because what you’re getting is not a JSON (it’s something that looks like JSON, but doesn’t follow the correct syntax), and in that case, the customer nay you’re right (but I understand the "pressure" involved).

Still I would try to argue that sending an invalid "JSON" is counterproductive for everyone - including the client himself - because using an non-standard format, it takes more development time from the reader, and can give errors that would not occur if JSON was correct. Using the correct format, everyone wins. But anyway, if he does not want to correct, at least be aware that he also suffers the consequences...

Another point is that regex nay is the ideal tool to handle a JSON (albeit be possible, is not the best option). Even if they are very simple data and "work", any solution will be very prone to errors and unforeseen situations, and in the end it can be more complicated than simply asking the client to send a valid JSON (not counting the overhead additional to use regex, because depending on the volume of data, this can cause a performance problem).


That said, a solution well naive would be:

import re

jsonStr = "{ dominio:casadasloucas.com.br, wp-admin:casadasloucas.com.br/wp-admin, wp-user:fulano, wp-pass: 1234 }"
jsonStr = re.sub(r'([^ :]+)\s*:\s*([^ ,]+)', r'"\1": "\2"', jsonStr)

regex assumes that the name of the keys is "anything other than spaces or colons": [^ :]+.

And the values can be "anything other than spaces and commas": [^ ,].

I mean, I’m basing myself on what looks like be the rule of the data sent (since no more details were given, just an example). As there are no quotes in the original data, I understand that there can be no spaces or commas in the values, because then the structure would be ambiguous and much more difficult to analyze (another argument in favor of sending a valid JSON).

With this, the string becomes a valid JSON:

{ "dominio": "casadasloucas.com.br", "wp-admin": "casadasloucas.com.br/wp-admin", "wp-user": "fulano", "wp-pass": "1234" }

If you want the numbers not to have quotes, you can use a replacement function:

def coloca_aspas(m):
    result = f'"{m.group(1)}": ' # nome da chave sempre entre aspas
    valor = m.group(2)
    try:
        int(valor) # verifica se o valor é número
        result += valor # se for, não coloca aspas
    except ValueError: # não é número, coloca entre aspas
        result += f'"{valor}"'
    return result

import re

jsonStr = re.sub(r'([^ :]+)\s*:\s*([^ ,]+)', coloca_aspas, jsonStr)

With that, JSON stays like this:

{ "dominio": "casadasloucas.com.br", "wp-admin": "casadasloucas.com.br/wp-admin", "wp-user": "fulano", "wp-pass": 1234 }

Its regex did not work because the shortcut \w considers only letters, numbers and the character _, then he ignored the characters such as the dot, bar and hyphen. That’s why he didn’t get the expected result.

And the Lookahead (?=\D) only checks if the next character is not a digit. Nothing that helps a lot in quotation marks in the right places.


Do not use regex

But as I said, this solution is quite naive. I don’t know if all possible values are exactly the way I defined them (can I have values with quotation marks, two-points, spaces, keys?), nor can I have more complicated cases that would require a change in regex (I didn’t test with nested objects and arrays, for example). If so, you probably would have to build your own parser customized, since the module json does not serve to read formats-that-look-but-are-are-JSON.

  • I agree with you, but the client does not agree with us :-) Your RE almost worked... when I applied, the conversion was like this: "{ domain": "casadasloucas.com.br"", wp-admin": "casadasloucas.com.br/wp-admin", wp-user": "so-and-so", wp-pass": "1234", cloudflare": "yes } " I will test your function... Thank you!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.