Comments on regular expressions?

Asked

Viewed 338 times

6

REGEX is a very complicated language and not easy to read, but is there any way to describe what that little letter does? Example:

Nessa regex: \{.*?[^\}]+\} will capture everything you have in a field { } but there is some way of saying what each character does?

I was curious why I found this regex:

                        # Character definitions:
                        '
                        (?> # disable backtracking
                          (?:
                            \\[^\r\n]|    # escaped meta char
                            [^'\r\n]      # Data character except '
                          )*
                        )
                        '?
                        |
                        # Normal string & verbatim strings definitions:
                        (?<verbatimIdentifier>@)?         # this group matches if it is an verbatim string
                        ""
                        (?> # disable backtracking
                          (?:
                            # match and consume an escaped character including escaped double quote ("") char
                            (?(verbatimIdentifier)        # if it is a verbatim string ...
                              """"|                         #   then: only match an escaped double quote ("") char
                              \\.                         #   else: match an escaped sequence
                            )
                            | # OR

                            # match Data char except double quote char ("")
                            [^""]
                          )*
                        )
                        ""

It captures a C#String, it is described by the comments with the character #. Is this real? Spaces are removed while compiling the expression?

  • Just one thing, the . NET Framework Regex also supports this?

  • 1

    I improved the answer.

  • 1

    Not directly related, but if you only want what’s between brackets, you don’t need to use .*? in regex, only \{[^\}]+\} suffice: https://regex101.com/r/gWEh0A/2/ :-)

2 answers

8


Yes, these are comments for the regular expressions of C# (or more specifically of .NET, since it is valid for all languages that adopt .NET as a library). All that is after the # until the line dim will be ignored. It is also possible to determine the beginning and end of the comment with parentheses.

In other languages or more specifically in other regular expression libraries comments may be available with a slightly different syntax.

Documentation.

  • Thanks for the reply, sorry for the delay.

2

Complementing the another answer: in accordance with documentation, to be able to use these comments, you must enable the mode "ignore Pattern white-space", either using the flag (?x) (also called mode Modifier, or inline option), or the option RegexOptions.IgnorePatternWhitespace.

Examples:

string s = "abc";

// usando (?x), imprime "True"
Console.WriteLine(Regex.IsMatch(s, "^[a-z]{3}(?x) # começa com 3 letras"));
// usando RegexOptions.IgnorePatternWhitespace, imprime "True"
Console.WriteLine(Regex.IsMatch(s, "^[a-z]{3} # começa com 3 letras", RegexOptions.IgnorePatternWhitespace));
// sem nenhuma das opções, não é interpretado como comentário, imprime "False"
Console.WriteLine(Regex.IsMatch(s, "^[a-z]{3} # começa com 3 letras"));

In the third case, as none of the options ((?x) or RegexOptions.IgnorePatternWhitespace) was used, the excerpt # começa com 3 letras (including the space before the #) is considered part of the regex. That is, in that case the regex would only find a match if the string were "abc # começa com 3 letras".


One detail is that this option is not limited to enabling the use of comments. As the option name says (ignore Pattern white-space), when it is enabled, regex ignores the blank spaces of the expression (and also tabs and line breaks - which, by the way, allows comments to be used, since the comment goes from # to the end of the line, and allows regex to continue on the next line).

And in this case, there’s a difference between (?x) or RegexOptions.IgnorePatternWhitespace: the first can be anywhere in the expression, and only affects the passage that appears after it. The second always affects the whole expression. Ex:

string s = "abc";

// usando (?x), imprime "False"
Console.WriteLine(Regex.IsMatch(s, "a  b(?x)   c"));
// usando RegexOptions.IgnorePatternWhitespace, imprime "True"
Console.WriteLine(Regex.IsMatch(s, "a  b   c", RegexOptions.IgnorePatternWhitespace));

In the first case, the (?x) just ignore the spaces before the c. Already the spaces between a and b are part of the regex, and so she does not find a match and prints False.

In the second case, RegexOptions.IgnorePatternWhitespace causes all regex spaces to be ignored, so it finds a match and prints True.


There is also an alternative syntax for (?x) only affects part of the expression:

string s = "abc";
Console.WriteLine(Regex.IsMatch(s, "a(?x:   b)  c")); // False
Console.WriteLine(Regex.IsMatch(s, "a(?x)   b   c")); // True

In the first case, (?x: b) means that all spaces before the b will be ignored. But spaces before the c are outside the parentheses, so they are part of the regex.

Already in the second case, (?x) causes all the spaces that are after it to be ignored.

Another option is to use (?-x), that "shuts down" the (?x):

string s = "ab c";

Console.WriteLine(Regex.IsMatch(s, "a(?x)   b c")); // False
Console.WriteLine(Regex.IsMatch(s, "a(?x)   b(?-x) c")); // True

In the first case, all regex spaces are ignored, so she’s actually looking for abc. But since the string in question has a space, the result is False. In the second case, the space after the (?-x) is not ignored, since it "shuts down" the (?x), and so space before the c is part of the regex, and the result is True.


If you want regex to have spaces, even with this option enabled, the way is to use [ ] (one character class containing a space - notice that there is a space between the brackets) or the shortcut \s (remembering that \s also corresponds to TAB and line breaks).


To include comments in a regex, there is also the syntax(?# comentário ). In this case, you do not need to enable the option ignore Pattern white-space, for all that is between (?# and ) is ignored:

Console.WriteLine(Regex.IsMatch("abc", "ab(?# um comentário)c")); // True 

Remembering that if you have a space outside the stretch (?# ... ), these are part of the regex, and if you want to ignore them, then it is necessary to enable the option ignore Pattern white-space:

string s = "abc";
Console.WriteLine(Regex.IsMatch(s, "ab(?# um comentário)  c")); // False 
Console.WriteLine(Regex.IsMatch(s, "ab(?# um comentário)(?x)  c")); // True 

This mode is also called free Spacing mode in some Engines.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.