Match in regular expression with special REGEX symbols and line breaks

Asked

Viewed 348 times

2

I need help with a regular expression match in a string whose:

  • beginning and end of a line with the character }
  • keep marrying everything you find in the next line (including other "escaped" characters like \)
  • up to another character } at the end of the rear line.

I tried several combinations and I couldn’t.

  • You can give an example of text and clarify which programming language?

  • 1

    it is necessary to specify where the regex will be used, depending on the place, the arguments change

2 answers

6


I will assume that the regex Flavour is similar to PHP, one of the most common.

I will also assume that the escaped characters are line breaks, which the operator "." not capture, so you can use this regex here:

}(.|\n)*?}

Explanation:

  • } determines that it will only start capturing if there is key closing.
  • ( declares the start of the catch group.
  • .|\n makes it capture any character that is not line break, or that is line break (at the end, any character).
  • ) declares the end of the capture group.
  • *? quantify Lazy, ensuring that it will stop the capture in the first occurrence of key closing }, avoiding that capture unnecessary things if there is more than one occurrence of {} in the code.
  • } sets the lock condition to finish the capture

You can test it here

  • You can do the . (Dot) capture \n using the flag s. as I explained earlier here.

2

Just complementing, the another answer does not take into account an important detail of the question: the characters } should be at the end of the line. That is, if we have a text like this:

abc } def
blablabla
xyz }

The regex shall not consider } first line, because it is not at the end of the line. But the solution proposed by the other answer considers this case also.


So that the regex considers only the } at the end of the line, we can use the bookmark $. For default it means "end of string", but many languages and Engines have a flag that makes it also match the end of the line. On the site regex101.com, for example, just activate the flag multiline (option m in the upper right corner), so the regex would be }$(.|\n)*?}$ see the difference.

This flag is present in most languages, usually with the name of multiline: for example, we have this flag in PHP, in Python, in Java, in Javascript, etc. Search for the documentation of the language/tool you are using, most have this option.

Another detail is that the point, by default means "any character, except line breaks". So that it can also consider line breaks, you can use the proposed option in the other answer (.|\n), but this option ignores Windows line breaks (\r\n). A better alternative would be to enable flag DOTALL (also called singleline, which is a somewhat confusing name, given its function, which is to make the point correspond to line breaks).

In regex101.com it is called singleline, and if I activate it, I can change the regex to }$.*?}$, see. In PHP the option is s (but your name is PCRE_DOTALL), ditto in Python and Javascript, and in Java is only DOTALL (although admits the syntax (?s) within the expression, which also enables this flag).

Yes, in many languages flags can be enabled in regex itself. Check the documentation, but in this case the most common syntax is (?s)(?m)}$.*?}$ (the (?s) enables the mode DOTALL and the (?m) enables the mode MULTILINE - see here that the operation is the same).

Another alternative is to use }$[\s\S]*?}$. The shortcut \s considers spaces and line breaks, and \S is "anything that is not \s". So [\s\S] is all that is \s and all that is not \s - ie, it is a trick to catch any character including line breaks. This way, you do not need the flag DOTALL, see (is usually used in Engines that do not have this option).


How many lines between the }?

The above regex takes as many lines as necessary until you find the second } at the end of a line. But from the description, I understand that you really just want there to be a line between the two }. In this case, the regex could be:

}(\r\n?|\n).*(\r\n?|\n).*}(\r\n?|\n|$)

Now I use (\r\n?|\n) to consider Windows line breaks (\r\n), of Macos (only the \r, since the \n? indicates that the \n is optional), or only one \n, which is the line break of Unix/Linux.

One detail is that so I don’t need the flags MULTILINE and DOTALL. So the regex now takes one } followed by line break, then catch .* (zero or more characters) followed by another line break, followed by .* (zero or more characters), followed by }, followed by another line break.

This way I guarantee that there will be only one line. Note that in this case, the flag DOTALL must be switched off to prevent .* take more than one line. And also note that at the end we have |$, because in addition to line breaking, we can have the end of the string (for cases where the } is the last character, and has no other line after). See here the regex working.


It can be kind of "tedious" to repeat several times the same expression of line breaks. In this case, there is the feature of subroutines - that some Engines support, refer to your language/tool documentation to see if it is possible to use it:

}(\r\n?|\n).*(?1).*}((?1)|$)

In this case, the parentheses form a catch group, and we can refer to them later with (?1) (basically that means "use here the same expression that was used in the first capture group"). See here the regex working.

Some Engines still support named groups, which can help make regex a little more "readable" (or not, it’s a matter of opinion):

}(?<linebreak>\r\n?|\n).*(?&linebreak).*}((?&linebreak)|$)

(?<linebreak> defines the group called linebreak and (?&linebreak) means "use the same group expression here linebreak". See here her working.


Finally, some Engines de regex support the shortcut \R, which corresponds to a line break (both the \r\n as to the \n or the \r alone, among other characters - varies according to language). So you could also use something like }\R.*\R.*}(\R|$) (see here), or }(\R).*(?1).*}((?1)|$) (see here).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.