Find character succeeded by another character using Regex

Asked

Viewed 200 times

4

I am trying to assemble a regular expression that can identify and correct an invalid JSON. What I am trying to do specifically is the following, using as an example the following JSON:

{
    "array": [{
        "id": "123",
        "anotherObject": {
            "name": "something"
        },
        "address": {
            "street": "Dreamland",
            "anotherArray": [
                [3, 3]
                [3, 3]
            ]
        }
    }]
}

In case the key anotherArray is invalid since a , between the first and the second array. I wonder if it is possible to create a regex that can identify when a ] is succeeded by a [ and add a comma in the middle using re.sub(), for the end result to be this:

{
    "array": [{
        "id": "123",
        "anotherObject": {
            "name": "something"
        },
        "address": {
            "street": "Dreamland",
            "anotherArray": [
                [3, 3],
                [3, 3]
            ]
        }
    }]
}

The most I could do was (?<=]) but he finds all closes clasps, not only those who are succeeded by opens clasps.

  • 2

    Using a single regex and the way it was described in the accepted answer seems like a mistake. If you eventually have a JSON like this, "even if valid", "street": "[foo] [bar]" this will fail because the regex will cause something like: "street": "[foo], [bar]", which would change the string anyway and of course to you it may seem a little bit, but it’s just an example, something like this can cause a lot more things that you can’t predict To solve this you would probably need to create your own "parse", where you would identify "strings" and "keys", [...]

  • 2

    [...] but I tell you, it will not be easy, you will have to do much more than a regex and also you will need to put a lot of logic for the script to know how to act in possible situations, an example of where I went through similar problem is that in my PHP framework I create a selector for CSS style DOM, when I run into a selector like this [foo="abc[def]"] because of the [ and ] failed, so I had to take everything isolate and unblock so that whatever was inside [] was solved first and then would solve the ending. Other PHP selectors like phpquery fail because dev did not anticipate such problems.

4 answers

6

The answer is being published only for the purpose of sharing knowledge about the exception launched and its fields and it is not recommended to apply the code here in production. The ultimate and safe solution will always be to correct at the source of the problem where JSON is generated.

To get around solely and exclusively the lack of commas in JSON, you can check which is the exception launched by json.loads analyzing the error message.

When JSON is invalid, the exception json.decoder.JSONDecodeError is released and it has some information that may be useful depending on the context. The field msg has the error message launched; the field doc has its own parsed JSON and field pos has the position in JSON that the error occurred. As we will only treat the absence of the comma we can make a recursive call by adding the missing comma in JSON.

def json_loads_with_missing_commas(data):
    try:
        return json.loads(data)
    except json.decoder.JSONDecodeError as error:
        if error.msg == "Expecting ',' delimiter":
            # Cria um novo JSON adicionando a vírgula onde deu erro
            data = error.doc[:error.pos] + ',' + error.doc[error.pos:]
            return json_loads_with_missing_commas(data)
        raise error

For example JSON, when removing all commas we would have:

{
    "array": [{
        "id": "123"
        "anotherObject": {
            "name": "something"
        }
        "address": {
            "street": "Dreamland"
            "anotherArray": [
                [3 3]
                [3 3]
            ]
        }
    }]
}

In doing json_loads_with_missing_commas(data) to analyze the above JSON we would have the output:

{
    "array": [
        {
            "id": "123",
            "anotherObject": {
                "name": "something"
            },
            "address": {
                "street": "Dreamland",
                "anotherArray": [
                    [
                        3,
                        3
                    ],
                    [
                        3,
                        3
                    ]
                ]
            }
        }
    ]
}

See working on Repl.it

Any other error in JSON would be propagated without changes.

  • 1

    I didn’t test it, but it seems to be the ideal thing. Congratulations, this is a solution considered in the problems. As soon as I test it calmly, I’ll put out a reward. PS: the beginning of the answer anyway already defines the resolution, personally I would have criticized the indiscriminate use of regex, but I believe it was implicit.

  • 1

    Excellent strategy!

5


It’s possible, I just don’t know if it’s worth it. Depending on how varied your data is, making a regex that covers all cases can get too complicated. The ideal is to correct the data at source, so it always generates a correct JSON.

That said, for this specific case, I could do something like this:

invalido = """
{
    "array": [{
        "id": "123",
        "anotherObject": {
            "name": "something"
        },
        "address": {
            "street": "Dreamland",
            "anotherArray": [
                [3, 3]
                [3, 3]
            ]
        }
    }]
}
"""

import re

valido = re.sub(r'\][^,]+\[', r'], [', invalido)

import json

dados = json.loads(valido)

The idea of the regex is to catch a ], followed by [^,]+ (one or more characters other than comma) followed by [.

In substitution, I trade all this for ], [ (put a comma between them).

This solves for this case, but if JSON has for example:

{ "chave": "[valor1]  [valor2]" }

The regex cannot detect that the stretch ] [ is part of a string and should not be replaced. And to detect these cases, it starts to get too complicated and may not be worth it.

As already said, the ideal is to correct this at source (who generates the JSON must ensure that is valid, who reads should not worry about tidying up these things).

  • Good! I’d just rather keep: re.sub(r'\]([^,]+)\[', r'],\1 [', invalido)

2

Here is a solution that does not modify the original JSON structure (in case you want to save it back to a file with the same "face", for example):

string = """
{
    "array": [{
        "id": "123",
        "anotherObject": {
            "name": "something"
        },
        "address": {
            "street": "Dreamland",
            "anotherArray": [
                [3, 3]
                [3, 3]
            ]
        }
    }]
}
"""

import re
import json

pattern = re.compile(r'\][^,]+\[')
for match in re.findall(pattern, string):
    sub = match.replace(']', '],')
    string = string.replace(match, sub)

json.loads(string)  # não dá erro

regex looks for a character ], followed by any characters except commas until the next character [.

Then in the loop is made the substitution of ] for ], for each match found in your file.

If you print the string after the loop, you will see that the spacing and line structure remains the same, only with the substitution made.

I tested only with your example, but I believe that the structure should be maintained in different cases, except in the case of having more than one ] before the [ (though in such case the file has even bigger formatting issues...)

1

As stated:

  • 1: try to correct at source
  • 2: try to fix using Try json.loads

Using strictly regular expressions I would try to mark ONLY the occurrences of "] spaces [" keeping the formatting, not removing anything:

j = re.sub(r'\](\s*)\[',  r']FIXME\1[', j)

and studied if there were surprises... and if there are no occurrences in the unforeseen, replace FIXME by ","

j = re.sub(r'\](\s*)\[',  r'],\1[', j)

Although it is not the ideal way, in many cases this type of solutions are useful.

As mentioned there are contexts where the comma should not be inserted. It’s easy to handle the "[ ], [ ]" string case, but there are more cases...

Browser other questions tagged

You are not signed in. Login or sign up in order to post.