Data processing in JSON format with Regex in Javascript

Asked

Viewed 413 times

1

I have a problem in a project with PHP backend and front in HTML, Javascript and CSS.

The backend is sending the data to the front through JSON, however, in one of the excerpts of the data there are texts with quotes, example:

{
   "descrição": "Eu faço trabalhos "fáceis" porém cansativos"
}

JSON by default recognizes quotes from "fáceis" as the end of the die, and I wish it were not so.

I can give a:

myJson.replace(/"/g,'"');

But then it will exchange all JSON quotes, generating an input like this:

{
   "descrição": "Eu faço trabalhos "fáceis" porém cansativos"
}

What breaks it all.

I’m looking to write a regex that replaces the quotation marks with " only if they are within a JSON data field. Or there is some other way to treat this case?

  • 2

    If JSON is quoting exactly like this, then the error is in the backend, which is generating an invalid JSON, and that’s where it should be fixed. Trying to fix with regex is not simple (do not use regex to manipulate JSON - to a valid it’s hard enough, for an invalid, it’s even worse). Probably you will have to manually manipulate the string, since usually parsers give error when JSON is invalid (or fix where it is generated, which is the most indicated)

  • Saul, it would be interesting for you to add in the question the code that generates the JSON, or at least inform which language you are using to generate this JSON.

1 answer

1


Like I said before in the comments, If the JSON is coming with quotes exactly that way, then it is not a valid JSON. In that case, it is best to correct in the backend, so that it manages the JSON correctly - in this case, with the escaped quotes (\"):

{
    "descrição": "Eu faço trabalhos \"fáceis\" porém cansativos"
}

Treating the problem at source ensures that whoever is receiving this data will not need to worry about tidying it up, because it is not such a simple task.

As you said that the backend is PHP, make sure you are using the function correctly json_encode (which is the simplest way to generate JSON in PHP). Or, if you are using another engine/API/framework, check that everything is correctly configured, if the parameters are correct, etc, because it is more likely that the problem is there.

The rest of the answer below is only to show how using regex can be a bad, more complicated and even unnecessary solution if you fix the problem at source.


See for example how a regex would look for your case (will not work on all browsers):

let s = `{
   "descrição": "Eu faço trabalhos "fáceis" porém cansativos",
   "teste": "Aqui não tem aspas a mais",
   "teste2": "Aqui " tem várias " aspas a mais""
}`;
let r = /(?<!"\s*:\s*)"(?![\n\r,:]|[^"]+":)/g;
console.log(s.replace(r, '&quot;'));

Basically, it picks up the quotes, taking into account various factors:

The Negative lookbehind (?<!"\s*:\s*) checks if something nay before the quotation marks. In this case, we have a quotation mark, followed by \s* (zero or more spaces), two dots, zero or more spaces. That is, it cannot be the first quote just after the :.

Obs: the lookbehind currently only works in Chrome. But even if you use another language - other than Javascript - that supports this feature, still worth reading the rest of the answer.

The Negative lookeahead (?![\n\r,:]|[^"]+":) checks if something nay exists after quotation marks. In this case, it is [\n\r,:] (a line break, or comma, or two points). Thus, I do not consider the closing quotes. But the | (which means or) admits another possibility: [^"]+: - one or more characters other than quotation marks, followed by two dots (without this, the regex would also take the first quotation marks of each line).

Basically, all these rules are to disregard legitimate opening and closing quotes. But this regex does not cover all cases.


For example, if we have an array, it no longer works:

let s = `{
   "lista": [ "Eu faço trabalhos "fáceis" porém cansativos" ]
}`;
let r = /(?<!"\s*:\s*)"(?![\n\r,:]|[^"]+":)/g;
console.log(s.replace(r, '&quot;'));

In this case, it replaces all quotes within the array. Then we need to put more conditions in regex to indicate the new condition. For example, I could indicate that you should ignore the quotes right after the [ or just before the ]:

let s = `{
   "lista": [ "Eu faço trabalhos "fáceis" porém cansativos" ]
}`;
let r = /(?<!"\s*:\s*|[\[,]\s*)"(?![\n\r,:]|[^"]+":|\s*\])/g;
console.log(s.replace(r, '&quot;'));

But there are still cases where it might fail. For example, if within the string we have some of these characters, such as : or []:

let s = `{
   "fad": "fasdfa "fa" : "fasdfasd"sdfa",
   "xyz": [ "af [ "xyz " ad", " fasd "fasf" dfs"]
}`;
let r = /(?<!"\s*:\s*|[\[,]\s*)"(?![\n\r,:]|[^"]+":|\s*\])/g;
console.log(s.replace(r, '&quot;'));

Now, as inside the strings I have the characters : and [] (that I had used as the reference points to know if I’m at the beginning or end of a string), the regex gets lost because it’s not checking whether it should be inside a string or not.

I even believe that it is possible to continue and include this amendment, but I think the regex is already complicated enough and not worth it anymore.


All this is to show that it might not be worth trying to fix JSON with regex. Try to fix the JSON where it is generated instead of create a bigger problem while trying to solve it with regex.

  • The answer is very good, but I think the question still did not deserve an answer, because the way it is, AP will use regex to solve a problem that should be done in the backend (even though you warned not to do this)...

  • @fernandosavio My intention was to show how the solution with regex is complicated and not worth it, but what you said makes sense. Although, following the recommendation or not is already a decision of the AP, because the warning was given... But I think it’s enough to come up with a JSON a little bit more complicated to realize that in fact it’s not worth it...

  • 1

    Exactly, the answer is good, my fear is that it will be used on the dark side of the force. : P Just in case I’ll ask the AP what is the code that generates JSON, can prevent a good soul to use regex for this in the future.

  • @fernandosavio Yeah, I forgot to ask this... But if the backend is PHP, I think he should not be using json_encode, because I don’t see a situation where it generates an invalid JSON like this... Anyway, I put a few more warnings at the beginning of the reply..

  • 1

    Man, I doubt that a native library of any backend language will generate a JSON with such a basic error of these... Anyway, if the AP is interested in a solution to his real problem, it will only depend on it now. : D Take my +1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.