Regular expression to return text between keys

Asked

Viewed 2,117 times

4

I have this String:

str = "Eu sou uma string {dentro de uma string {dentro de outra} }"

What regular expression can I use to get just:

dentro de uma string {dentro de outra} 
  • 3

    A hint not related to the question: avoid using variables with names that are the same as Python types. In the case "str" is the same as in the class str python. It may even seem that "it’s a loose example, it doesn’t bother anyone" - but just in this case, for example, if someone is copying and pasting the above statement to test on the Python console, it goes over-write the original "str".

3 answers

3

Although it’s probably possible to do this with regular expressions, maybe it’s not the best way to do it - Regular expressions are a language part of the "host" language - and are actually "hostile" to whoever is checking the code. They’re very useful in some cases, and as you get used to them, or exercise with them, they become worthwhile in scenarios with increasing complexity - since, you being familiar, you will be able to write the regular expression faster than you would write equivalent code in the language itself (in this case, Python).

In this case, the regular expression for this would have to rely on match groups, count groups backwards, optionally - and, in fact, it might not be possible to write some mechanism equivalent to "recursion" to ignore a variable number of }, if there are several nested. For a fixed maximum number of subkeys, I believe it is possible yes - and maybe I’ll make a try soon.

However, what I’ve written so far is to say that sometimes, programming in the language itself, some tasks are orders of magnitude simpler than using regular expressions.

And that seems to be the case here. After all, all you need is to keep a variable that will count how many keys are open, while the string is traversed character by character - when the counter reaches "0", you have a chunk between keys at the highest level. In Python are few lines:

def find_bracket_groups(text):
    open_brackets = 0
    groups = []
    current_group = ""
    for char in text:
        if char == "{":
            open_brackets += 1
        if open_brackets > 0:
            current_group += char
        if char == "}":
            open_brackets -= 1
            if open_brackets == 0:
                groups.append(current_group)
                current_group = ""
            if open_brackets < 0:
                open_brackets = 0
    return groups

Although it looks like a big function, reading is trivial - and a bit of size is some care with special cases, for example, treat "}" characters that may be loose in the text.

Also note that if necessary it would be easy to change this function to receive more parameters, and be able to locate "deeper" groups instead of just the first, or find groups with other characters other than "{}". (In contrast, parameterizing a regular expression would require the regular expression to be written as a string that would be formatted with the method ". format" before it is used as a regular expression - making the reading code even more complicated).

another way

If you always have a single group of "{...} " in the text root, as in the example - that is, nothing of the type "Aaaaa { occurrence 1 {bbbb}} cccc { occurrence2}", it is possible to have a trivial Python code, which only "clippes" the string in the first character "{" and the last character "}" - with the methods .split and rsplit of the string objects:

text = "Eu sou uma string {dentro de uma string {dentro de outra} }"
result = text.split("{", 1)[-1].rsplit("}", 1)[0]

The expression on the last line says "cut the string into substrings separated by the "}" character, starting from the left, only once. Take the last part of this division (discarding the first), and cut in the "}" character from the right one time, and take the first part of that division.

And note that this expression does not verify whether it actually exists whichever key in the string, so it would have to be supplemented with a if "{" in text: ... at some point for your code to be robust.

update

As I commented above - I tried to create a regular expression using the pattern (?(<grupo>)\}) to match an internal "}" only if another internal "{" has been seen (in the documentation look for "yes-Pattern|no-Pattern"), but it’s really not trivial - and it’s really hard to keep track of (didn’t work out). For an arbitrary number of nested keys, only creating a regexp "Frankenstein" with a maximum number of groups of {, each with the respective group "yes-Pattern" for the } corresponding - would not be practical.

  • if I want to, for example, identify how the keys are related. For example I have the expression "a{ b{ c } d{ e } }", if I just want the keys, I’ll have: {{}{}}, but how can I identify which key opens and which closes a certain expression?

  • 1

    the code above is a program - you have the knife, the cheese, in hand, and you can create as many variables as you want to hold the position of whatever you want: the sauces, mayonnaise, mustard, and side dishes. And no, it doesn’t just go back to the key structure - it comes back to what’s inside the outside group in a key structure. In this case, the function does not return the position of each key group, but just use the function enumerate in the for to have the position - with one more if inserted after the second line inside the for, can note the position of each group if you want.

2

TL;DR

For this specific problem, you do not need regular expression, you can use find to build a Slice, see:

print(str[str.find("{")+1:str.find(" }")])
dentro de uma string {dentro de outra}

But if you make a point of using regular expressions, do:

print(re.compile( "{(.*?) }").search(str).group(1))
dentro de uma string {dentro de outra}

See working on repl it.

Edited:
Given the comments of @jsbueno, I will reinforce what I put at the beginning (before any issues) of the answer: "For this specific problem", that is, the solutions presented here only serve strings that have the same format as the string originally presented in the question: ("I am a string {within a string {within another} }") (In particular, they depend on the blank space before the last key - jsbueno). I also take this opportunity to corroborate the jsbueno’s comment on the question, regarding the use of reserved words to name variables.

  • 1

    this regexp is not the answer and you know it - it is working with the example why it has a space before the "}" outside. The regexp engine does not offer control for the "match" of relatives/brackets/keys - without this space it will match the internal key

  • (Now a hint, other than the error: you don’t need to use "re.Compile" unless you’re going to manually cache compiled regular expressions - you can directly use re.find(REGEXP, target) )

  • 1

    The first way, with "find" will also not work as it is if you remove " before the key - but combining "find" with ". rfind" yes, it is possible to find the group more from outside, if each string has only one most external group between keys.

  • @jsbueno See what I made a point of putting in the answer (no edit) "to this specific problem". :-)

  • 1

    good - you know you’re wrong. the doubt is to get the nested keys - and anyone can see this, not to find a way that the outside keys NESSA string are distinct from the inside keys. Anyway, the mechanisms of the platform are there: you answer as you wish, I vote as you wish.

  • 1

    Now this incorrect answer can confuse - not only the author of the question, but other people in the future with similar problems. Then, only the downvote can no longer take care of. It is up to you to maintain.

  • Okay, @jsbueno no stress. The answer creates a complete and verifiable example with the string placed in the question, sorry, but I’ll keep it. Prospero 19, sincerely.

  • 1

    If you just write in the text that these solutions depend on space I already feel much more comfortable! :-)

  • @jsbueno edited the answer and added the remark, see if it was good.

  • In the end I think it was much more useful to keep the answer than to remove it, even for the comments that can arouse curiosities and will to learn. :-)

  • 2

    I only made sure of mencioanar that it depended on space -- I ended up putting in your note. I hope that is ok.

Show 6 more comments

1


It is even possible to make a regex that takes the content you need, but there are a few points.

The first is what has already been mentioned in the other answers: regex is not the best tool to solve this problem. It is unclear whether your strings will have only two nested key pairs, or whether the amount can vary and have no limit. Therefore, the verification made in the reply from @jsbueno is the most suitable solution.

The other points are mentioned below. Anyway, I leave here - more as a curiosity - a solution with regex.


Recursive regex

One way to capture content between key pairs, which may have other nested pairs, is by using a recursive regular expression. In your case, it could be something like:

\{((?:[^{}]|(?R))*)\}

\{ and \} are the characters { and } proper. As these characters have a special meaning in regex (they serve to define quantifiers), they need to be escaped with \ so that they "lose their powers" and are treated only as "normal" characters".

Then we have a parenthesis, which serves to define a catch group, which makes it possible to obtain its contents later. Thus, the expression \{(....)\} means that it will be possible to obtain all the contents between the keys later.

Right away we have (?:, defining a catch group. This indicates that these parentheses are only being used to group a sub-expression, and prevents the engine de regex create a capture group for nothing. Within this capture group, we have a alternation (defined by |), which means this group can have any one of these two things:

  1. [^{}] - any character other than { nor }, or
  2. (?R) - the integer regular expression itself, recursively

The first case ([^{}]) is a character class denied. The beginning ([^) indicates that I don’t want any of the characters that appear next. How I put {}, means I don’t even want the {, nor the } - note that inside brackets these characters do not need to be escaped with \.

The second case ((?R)) is the syntax for recursive regex. It basically means "take the whole regex and put it here instead of (?R)". That is, regex can "be inside itself", which is a way of checking key pairs within other key pairs.

Next we have the quantifier *, which means "zero or more occurrences" than it is before (i.e., the non-register group). That means we can have several characters that are not { nor }, or the same regex (which is {, followed by several characters other than { nor }, or the same regex etc...), all this repeated several times, and each of these occurrences has a } at the end:

explicação da regex recursiva

Thanks to recursion, regex can have an unlimited number of nested key pairs - see here an example of her working.

Unfortunately, the module re - which is the native Python module for working with regular expressions - does not support recursive regex. Therefore this code does not work:

import re

r = re.compile(r'\{((?:[^{}]|(?R))*)\}')

When trying to create recursive regex, an error occurs:

re.error: Unknown Extension ? R at position 13


So how do I do?

Fortunately, in Pypi there is the module regex, that extends the functionality of the module re and has recursive regex support:

import regex

r = regex.compile(r'\{((?:[^{}]|(?R))*)\}')
print(r.findall("Eu sou uma string {dentro de uma string {dentro de outra} }"))

The module regex is compatible with the module re, then your methods behave in the same way. The method findall, for example, returns a list of all strings you gave match, but when there are capture groups, it returns a list with the groups (this is another reason to use the catch group internally: so that it is not returned by findall). So this code prints:

['dentro de uma string {dentro de outra} ']

Testing the same regex for other strings:

print(r.findall('String {com {vários {níveis {de } chaves}} aninhadas}.'))
print(r.findall('Apenas {um par de} chaves'))
print(r.findall('String {com mais {de} um} trecho {com {chaves}}'))

Exit:

['com {vários {níveis {de } chaves}} aninhadas']
['um par de']
['com mais {de} um', 'com {chaves}']

Note that the latter has more than one possible snippet to capture, so the returned list has two strings.


There are two more couplets in this solution.

The first is that this regex also captures {}. That is, this code:

print(r.findall('Chaves {} vazias'))

Returns a list with an empty string:

['']

This happens because I used *, which means "zero or more occurrences". If I want there to be at least one character inside the keys, just switch to +, which means "one or more occurrences" (see here the difference):

import regex

r = regex.compile(r'\{((?:[^{}]|(?R))+)\}')
                                     ^
                usando + no lugar de *

Thereby, findall ignores excerpts that have no character between the keys (ignoring occurrences of {}), returning an empty list:

print(r.findall('Chaves {} vazias'))

Exit:

[]

Notice that now the exit is a empty list (no group was found) instead of a list with an element (a group has been found, which is an empty string). For the other strings above, the return remains the same.


The other though is that this regex does not check if the keys are balanced. For example, for the string String com {chaves {desbalanceadas}, the captured portion will be desbalanceadas, for it is the stretch that is within a balanced key pair (the stretch {chaves possesses {, but does not possess the } and so is out). Example:

print(r.findall('String com {chaves {desbalanceadas}'))
print(r.findall('outro exemplo com {chaves} desbalanceadas}'))
print(r.findall('Somente abre { chaves.'))

Exit:

['desbalanceadas']
['chaves']
[]

Note that in the second string the captured chunk was chaves, because it is what is inside a balanced pair of keys (the desbalanceadas} only has the closing key, and so was left out). Already the third string does not have }, then no snippet is captured (so the return is an empty list).

Maybe it’s even possible to get chaves {desbalanceadas in the first case and chaves} desbalanceadas in the second, but then the regex starts to get too complicated. And anyway, we have seen in the other answers that the ideal solution does not use regex and is in fact simpler, not only to do and understand, but also to modify (for example, to obtain the positions of each { and its respective }, etc). Regex is a powerful and very cool tool (I particularly like it a lot), but is not always the best solution.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.