Regex to take a stretch that may or may not occur

Question

Regex to take a stretch that may or may not occur

Asked 5 years, 7 months ago

Viewed 129 times

1

I have a phrase, where before and at the end of it I have {{ and }}

Example: {{UMA FRASE DE EXEMPLO}}

In some cases, I may have a parameter for that phrase. To identify that it has a parameter I use _, then I would have: {{UMA FRASE DE EXEMPLO}}_(123)

Currently I do so to catch the phrase between the {{ and }}

var command = "{{UMA FRASE AQUI}} + {{OUTRA FRASE AQUI}}_(123)";
Regex r = new Regex(@"\{\{[^\}]+?\}\}");
var m = r.Matches(command);

The m returns me 2 counts:

m[1] = {{UMA FRASE AQUI}}

m[2] = {{OUTRA FRASE AQUI}}

When I have the _ I need you to stay like this:

m[2] = {{OUTRA FRASE AQUI}}_(123)

1 answer

Browser other questions tagged c# regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-12-13T14:53:18+00:00

Just get the part after the _ be optional:

Regex r = new Regex(@"\{\{[^\}]+?\}\}(_\(\d+\))?");

I’m assuming the ID is numerical, so I used \d+ (one or more digits). The parentheses around the number must be escaped with \, and around all this I put parentheses to group everything and leave this whole stretch optional, using ? (the interrogation makes the whole stretch (_\(\d+\)) optional).

If the ID can have letters and numbers, an alternative is to exchange \d for \w:

Regex r = new Regex(@"\{\{[^\}]+?\}\}(_\(\w+\))?");

Although the shortcut \w also considers the character _, then strings as ___ and __1__ will be considered valid. If you do not want _, can change to:

Regex r = new Regex(@"\{\{[^\}]+?\}\}(_\([a-zA-Z\d]+\))?");

To character class [a-zA-Z\d] considers letters of a to z (upper and lower case), plus digits (\d). But this regex does not consider accented letters, and in this case you could still use:

Regex r = new Regex(@"\{\{[^\}]+?\}\}(_\([\p{L}\d]+\))?");

The shortcut \p{L} considers all characters defined by Unicode, which are in the "Letter" categories (all from this list, beginning with "L"), that is, in addition to the accented letters, it also considers letters from other alphabets (Arabic, Japanese, Cyrillic, etc).

Anyway, there are several options and which to use depends a lot on how your data is. If you know, for example, that there are no cases like __1__ and all Ids are valid, use only \w (or [a-zA-Z\d], if there are no letters with accent) may be enough.

Another detail is that in the excerpt [^\}]+? you do not need the interrogation. In this case it serves to leave the quantifier + "lazy", but how are you seeking to [^\}] (anything that is not }) and then there’s the character itself }, there is no risk of regex going beyond what is necessary (which is one of the main reasons to use lazy quantifiers).

So this ? shortly after the + can be removed:

Regex r = new Regex(@"\{\{[^\}]+\}\}(_\(\d+\))?");