How to avoid Greedy Repetition (.*) to search for string that has defined start, dynamic middle and defined end?

Asked

Viewed 50 times

3

I have a log file that is searched via regex ^WebServer:.*endOfLine. Considering the log below this regex is generating 485 iterations (Steps):

String:

amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. S
amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. S
amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. S
WebServer: error 2312. Falha de conexão com o destino. StartLine Lorem ipsum dolor sit amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. Sent Request
amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. S
amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine
WebServerClosed: status 10. S

The result of the capture promoted by regex is:

WebServer: error 2312. Falha de conexão com o destino. StartLine Lorem ipsum dolor sit amet, consectetur adipiscing elit...Lorem ipsum dolor sit amet, consectetur adipiscing elit...step1 endOfLine

Considering the same final product, which approach/technique has use to generate fewer iterations and the use of .* (Greedy Repetition)?

I’m using the: https://regex101.com

  • On which site/language/tool did you test? In regex101, for example, only 109 Steps: https://regex101.com/r/qEI5Lz/1/ (and the non-Greedy version gets worse, takes more than 290: https://regex101.com/r/qEI5Lz/2/) - and depending on the language/tool/engine, the amount may vary, so it’s interesting to put which one you’re using and how you’re measuring

  • @hkotsubo yes, there in Regex Debugger function

  • Oh yeah, it’s just Debugger it counts the steps from the start, but on the main screen I think it only counts from the line, so gave less...

  • Anyway, I don’t think I can make it much faster, and use the quantifier Lazy, as response below actually slows down...

2 answers

3


To leave the quantifier no-Greedy (also called Lazy, more information here and here), simply change the regex to ^WebServer:.*?endOfLine - the ? shortly after the * makes it not-Greedy.

Only in this case the regex gets slower: your version needs 485 steps, and the version Lazy needs more than 660 steps.

As explained here, whether or not Greedy does not necessarily make regex slower or faster, as it depends on the expression and of the string being checked.

In this case I would say that the most suitable seems to be the quantifier Greedy same. As the point, by default, does not consider line breaks, then .* already advances to the end of the line at once. And then it goes back to check if it has endOfLine, but as there are few steps, it is more efficient than the Lazy (that will test all characters after the :, until I find the endOfLine, and how it has much more characters before than after the endOfLine, the version Greedy ends up finding the match in less steps).

  • Show. As for the use of regex I do not know much. I was looking for a way to get the same result with the least amount of Steps possible because the iterated data set is gigantic.

  • @Fábiojânio Then maybe regex is not the best option :-) If you use any language and a program that reads line by line, seeing if it starts with "Webserver" and ends with "endOfLine", it will probably be faster than regex (most languages have functions such as startsWith and endsWith or similar, all generally faster than a regex)

-1

adds a ? after the * so that he stops being Greedy.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.