How to take a value among many keys, but ignoring the keys, except the most internal?

Asked

Viewed 379 times

1

I’m trying to get the value that’s inside the keys. Example:

{{{{{{{{{{{{{{{{{{{{Valor aqui dentro}}}}}}}}}}}}}}}}}}}}

I’m doing the tests on Regex101. I wanted the end result to be like this:

{Valor aqui dentro}

I’ve tried that way:

\{.*\}

But select the entire text.

  • What’s wrong with that question??

  • what language?

  • It’s not language, it’s for my TCC

  • I’m not using language

  • Ok. Pq the application of the command to remove the desired string depends on the language used. I could answer if it were in python, but I believe that, for example, in JS the syntax is different

  • Okay, it’s just that I’m doing the tests on Regex 101, that’s where I’m testing

Show 1 more comment

3 answers

5


First I think it’s worth explaining why your regex didn’t work.

By default, the quantifiers (as * and +) sane "greedy" (Greedy) and try to pick up as many characters as possible.

In your case, the regex is \{.*\}, that is: the character { followed by zero or more characters (.*), followed by }. The point corresponds to any character (except line breaks). And by "any character", understand that it is any one, including { and } (if regex deems it necessary to satisfy the expression, the point can rather take these characters).

So in your case, the criteria of regex are: have a {, followed by several characters (any one, including others { and }, if the regex deems it necessary), followed by a }. And what is the largest string that matches these criteria? The entire string. The snippet \{ of the regex takes the first character {, the .* takes all the other nineteen {, plus the text, plus the nineteen }, and finally the \} takes the last } (see here).


An alternative to resolve has already been given at another answer: use a quantifier "lazy" (Lazy), ie, exchange .* for .*?. This causes the regex to pick up as few characters as possible, and with that it no longer picks up { and } than it should (to learn more about quantifiers Lazy, read here and here).

But there are other ways to do it, too. The whole problem started because you used the dot, which corresponds to any character (either, including the own { and }). But you don’t want "any character", but "anyone who nay be it { or }". So you can use this regex:

^\{+(\{[^{}]+\})\}+$

Instead of the point, I used [^{}], which is a character class denied. It takes any character that nay whatever is between [^ and ] (in the case, it is any character that is not { nor }). Another detail is that I changed the * for +. That’s because the * means "zero or more characters", that is, if the text is {{{}}}, the regex finds the {}. Already the + means "one or more occurrences", so when changing the * for +, I guarantee it must have at least one character between the keys.

With that I no longer need the quantifier Lazy, because now there is no risk of picking up too many characters - the denied character class ensures that regex will stop once you find the first { or }, which did not occur with the point. This brings a little advantage, because the regex gets faster - compare here and here and see that the amount of steps decreases more than half (obviously, for a few small strings, the difference in performance will be irrelevant - and I admit that in most cases it is just micro-optimization - but depending on the size and nature of the texts and the regex used, the indiscriminate use of .* can lead to catastrophic results). Anyway, there is still another important factor, which is to make its intention clearer: when using .* you make it seem that anything serves in that stretch, already to be more specific with [^{}], you make it very clear that it’s not quite anything you can have there.

Finally, I used the markers ^ and $, which mark respectively the beginning and end of the string. So I guarantee that the string only has what is specified in regex, not one more character, not one less.


Since we’re talking about being more specific, you can keep changing the regex so you have exactly what you need. Use [^{}] is a little better than ., because it restricts the list of possible characters. But [^{}] still accepts many things you may not want, such as special characters, line breaks, emojis, etc. If you want to restrict further, could use other options. Examples:

  • [a-zA-Z ] - accepts letters from A to Z (upper and lower case) and spaces (note that there is a space before the ]). But it doesn’t accept accented letters, so...
  • [\w ] - the shortcut \w consider letters, numbers and the character _. And if you activate the flag Unicode (in regex101, click on the flag on the right side of regex and choose the option u), it also considers accented characters.
  • but if you don’t want numbers and the character _, can use [\p{L} ]: the shortcut p{L} considers all letters defined by Unicode (including other alphabets such as Greek, Japanese, Arabic, Cyrillic, etc)

Anyway, there are many possibilities, and everything will depend on what you need. Does your text only have letters of the Latin alphabet and no accents? Will you also have numbers, punctuation marks, etc? Or do you just want anything that is between the keys? Depending on the tool you use, it may also be that some regex does not work (Javascript, for example, does not support the shortcut \p{L} in all the browsers, but you could use [^\W\d_], which has a similar functioning).


Balanced keys

It was not clear whether the expression should be balanced (i.e., whether each { has a } correspondent). Anyway, below are some options with regex (although this nay be the best tool to check this kind of thing - in fact I think it’s not even the best solution to the original problem, to pick up the text between the keys, since a loop simple by string would already solve).

First the simplest case. If the amount of keys is always the same, you can just do something like:

^\{{19}(\{[^{}]+\})\}{19}$

In the case, \{{19} means "exactly 19 character occurrences {", and did the same for the \}. So I guarantee that the amount of { and } is the same (see here the regex working).

Now, if the number of keys can vary, then it’s more complicated. The ideal solution is to not use regex, and instead use a programming language to implement some algorithm similar to this.

But just as a curiosity, it is possible to verify this with recursive regex:

^(?=(\{([^{}]+|(?1))\})$)\{+(\{[^{}]+\})\}+$

The secret is in the passage (\{((?1)|[^{}]+)\}). First the expression is in parentheses, forming a catch group - and since it’s the first pair of parentheses, then it’s group 1.

Then we have a alternation (the character | means or), with two possibilities:

  • [^{}]+ - any character other than keys, or
  • (?1)- which is "the same expression that corresponds to group 1" (the same regex is called here, recursively)

That is, this passage can be interpreted as:

  • the character {, followed by:
    • any character other than keys, or
      • the character {, followed by:
      • any character other than keys, or
        • the character {, followed by:
        • any character other than keys, or
          • ...
        • the character }
      • the character }
    • the character }

With this, the regex checks if the keys are balanced. In addition, the whole stretch is in a Lookahead (amid (?= and )), that serves to verify if something exists ahead (but this something is not part of the match, and I did so because in recursive regex it was not possible to capture the last chunk between keys directly). Just after the Lookahead we have the regex we’ve seen above, to pick up the desired stretch.

See here this regex working, and notice that it only takes cases where the keys are balanced. The difference to the previous solutions is that now the stretch you want is in group 3, and no longer in group 1.

But as I said, I see this more as a curiosity than a practical solution. First because it’s too complicated a solution for something that can be solved with a simple algorithm, and second because not all languages support recursive regex (which can be considered a positive point, because then you don’t even consider using them). Regex is nice and I particularly like it a lot, but is not always the best solution.

  • 1

    My, what a show! I even learned how to flag /u worked. As always, another excellent response. :-)

  • 1

    @Luizfelipe Just remembering what I said about the flag u for regex101. In each language there may be differences. For example, in Python 3 the Unicode mode is default and you don’t even need a flag, so \w already considers all letters (and digits) defined by Unicode, in Javascript the flag u does not change the behavior of \w (will always be equal to [a-zA-Z0-9_] - I don’t remember if it depends on the browser), in other languages it can work the same - or similar - to regex101, and so on...

  • 1

    @Luizfelipe The ideone.com link I put up doesn’t show all the characters, I think it’s because the time limit is exceeded. But here you can see them all: https://repl.it/LightheartedOverdueDictionaries

  • About the comment above: anonymous links no longer work on repl.it so I made another version on: https://repl.it/@hkotsubo/Lightheartedoverduedictionaries#main.py

4

You can use the following regular expression:

/^{+({.*?})}+$/

Basically, it will give match in a string as follows:

  • ^{+ Start with one or more characters {;
  • ( We set up a capture group:
    • { Have a different character { (after all other previously selected);
    • .*? Contain any character until you find one } (since we are using the quantifier Lazy (?).
    • } Have a character }.
    • ) We finish the capture group.
  • }+$ Contains one or more characters } until the end of the string.

You can see running here.


Note that I used the Regular Expressions pattern supported by Javascript. So we can create a fiddle:

const string = '{{{{{{{{{{{{{{{{{{{{Valor aqui dentro}}}}}}}}}}}}}}}}}}}}';

const [, group] = string.match(/^{+({.*?})}+$/);

console.log(group);


Finally, it is worth making clear that this regular expression does not check whether the amount of aperture keys ({) is equal to the amount of closing keys }.

  • 2

    It is worth noting that the regex in question does not take into account the { } being balanced or not (which does not seem to be a problem here, just commenting even so much that I was positive)

  • In fact, I’ll edit the answer by adding this remark...

  • @Bacco, just out of curiosity, do you know if you can make this determination of the number of characters? I can’t think of any way to make him take that balance...

  • 2

    If it was the same character I could use backreference. With different characters I don’t see practical solution so immediately. Even, if the balancing was a real necessity, nor do I think that Regex would be the correct tool (and I am not suggesting that it is the best option in the case in question).

  • @Bacco and Luiz, it is possible to check the balance with regex, using recursive regex (I put an option at the end of the my answer). But in my opinion, it’s much far from being the best solution...

0

A quick fix could be :

({[^{}]+})

Capture all that lies between {} except { or }.

I grouped the result, maybe you need to reuse the captured data.

https://regex101.com/r/DRA2bD/2/

Browser other questions tagged

You are not signed in. Login or sign up in order to post.