What can make a regular expression slow?

Asked

Viewed 534 times

26

It’s happened here in the Stackoverlow I am recommended by some users to avoid the use of regex, in some cases as it may have undesirable performance.

I have also heard that, depeded in the way that a regular expression is constructed (being that, to capture something desired, there may be many ways to do it with regular expression), it may be slower.

Considering these factors:

  • Which makes regular expression slower than other ways of working with strings?

  • Depending on what I wish to do with regular expression, there may be some way to make it (with regular expression) more performative, just changing its construction?

For example, one of the expressions below can be considered "faster" than the others?

 // Apenas números. Qual é mais rápido?

'teste 123456 teste 123'.replace(/[^0-9]+/g, '')

'teste 123456 teste 123'.replace(/\D+/g, '')

'teste 123456 teste 123'.replace(/[^\d]+/g, '')
  • 2

    Related http://answall.com/questions/81426/regex-tempo

  • @Wallacemaxters, maybe this content can be useful http://answall.com/questions/5136/performance-de-replaces%C3%A7%C3%A3o-em-string/5239#5239, here he compares Regexp to other forms.

  • To test your expressions you can use http://jsperf.com/

  • I used a lot of the references on this site. http://www.regular-expressions.info and for the question I believe that here you can give a good idea of how much Regex can weigh to evaluate a simple string http://www.regular-expressions.info/catastrophic.html

  • Related: http://answall.com/q/138705/101

2 answers

18


Briefly

  • How specific she is.
  • The number of quantifiers you are using.

What is a REGEX

You should remember that, regardless of the language/library, regex are Deterministic finite automaton, which will be compiled and transformed into a state machine.

inserir a descrição da imagem aqui

So you should keep in mind that it analyzes letter by letter (symbol by symbol), until it checks whether the sentence is valid or not.

Example

Let’s say you have the following regex a(b|c). A very simple state machine is generated.

inserir a descrição da imagem aqui

  • The first state says that the letter/symbol must be a, obligatorily.
  • If you pass the 1st state the 2nd state can be b or c.

If any of the states is not long or none of the next states is found. The sentence is invalid.

  • ab, ac = Valid
  • ba = Invalido, no longer buy the first state that should be a.
  • aX = Invalidity, X not in the possibility of the 2nd state.

Quantifiers

The analogy of quantifiers is very simple.

inserir a descrição da imagem aqui inserir a descrição da imagem aqui inserir a descrição da imagem aqui inserir a descrição da imagem aqui

You generate a state that points to itself, or generates a "short" to not go through that state (*, {0,).

What they interfere with?

Here has a short summary and a good analogy (In my opinion).

A quantifier will tell you how many times that state can be validated. A poorly planned state with infinite quantifier can be catastrophic. .*/.+, Example of this here. (I learned in practice)

What happens is that it will simply continue to follow as long as it can until it reaches the end of the sentence. When it arrives at the end, it makes the reverse path, removing characters and checking if the next state (2º) is satisfied, in the first character that satisfies the next state (2º) it continues (for 3º).

Most of the time you don’t notice these problems because we work with strings small that the computer just goes back and forth so fast that it’s almost imperceptible.

Example

https://regex101.com/r/vrUk3U/1

That’s why he stopped at the last comma, not the first comma.

Specify

Remember to specify as much as you want.
A good example of specification is the beginning ^.

You should also have a mind that if you don’t specify where it should start from, then it will simply start from anywhere. Example here.

So to prevent her from unnecessary testing the simple act of being specific makes her faster.

Note

  • Some REGEX are easier to assemble using reverse logic. Instead of thinking about what you need, think about what you don’t need.

About the examples

'teste 123456 teste 123'.replace(/[^0-9]+/g, '')

'teste 123456 teste 123'.replace(/\D+/g, '')

'teste 123456 teste 123'.replace(/[^\d]+/g, '')

In terms of REGEX you will have no difference at all as the three will be compiled to [^0-9]. You may have a difference in compile time, in which your library will take different time to transcribe what you are saying.

  • 2

    Best explanation about regex I ever saw. Excellent.

12

This link contains a good explanation about, whose ideas I also explain here below, with an example I built and I went around in my machine.

Invalid entries can cause very poor performance with poorly defined regular expressions. An example of poorly defined regular expressions are those recognize entry in more than one possible way, and, consequently, try to recognize invalid entries in several possible ways.

In general, the time taken to recognize an invalid entry can increase exponentially with respect to the number of characters n of the entrance

Example

The regular expression xx+y can also be written as (x+x+)+y . The first expression translates to a deterministic finite automaton, without backtracking, while the second expression uses backtracking. When a valid input is recognized, there is no problem, as the automaton closes the search immediately. When an invalid entry is provided, however, the recognition mechanism will attempt all possible forms of recognition until it is certain that the entry cannot be recognized by Regex.

Here is an example of code and its corresponding output on my machine. The example is in C#, but the concepts are technology agnostic:

using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace TesteRegex
{
    class Program
    {
        static void Main(string[] args)
        {
            string padraoMau = "(x+x+)+y";
            string padraoBom = "xx+y"; ;
            string[] arrayBom = new string[] { "xxy", "xxxxxxy", "xxxxxxxxxxxxxxxxxxxxy", "xxxxxxxxxxxxxxxxxxxxxxy", "xxxxxxxxxxxxxxxxxxxxxxxxxxy" };
            string[] arrayMau = new string[] { "xxx", "xxxxxxx", "xxxxxxxxxxxxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxxxxxxx" };

        Console.WriteLine("---- Backtracking: sim; Entrada válida: sim ----");
        TestaRegex(padraoMau, arrayBom);
        Console.WriteLine("------------------------------------------------");
        Console.WriteLine("---- Backtracking: sim; Entrada válida: não ----");
        TestaRegex(padraoMau, arrayMau);
        Console.WriteLine("------------------------------------------------");

        Console.WriteLine("---- Backtracking: não; Entrada válida: sim ----");
        TestaRegex(padraoBom, arrayMau);
        Console.WriteLine("------------------------------------------------");
        Console.WriteLine("---- Backtracking: não; Entrada válida: não ----");
        TestaRegex(padraoBom, arrayMau);
        Console.WriteLine("------------------------------------------------");
        Console.ReadLine();
    }

    private static void TestaRegex(string padrao, string[] array)
    {
        Console.WriteLine("Tamanho da entrada  | Tempo gasto");
        Console.WriteLine("------------------------------------");
        Stopwatch sw = new Stopwatch();
        for (int i = 0; i < array.Length; i++)
        {
            sw.Start();
            Regex.IsMatch(array[i], padrao);
            sw.Stop();                
            Console.WriteLine("                   " + string.Format("{0,2}", array[i].Length) + "|  " + sw.Elapsed.ToString("ss\\.ffffff"));
            sw.Reset();
        }
    }
}

}

Exit:

---- Backtracking: sim; Entrada válida: sim ----
Tamanho da entrada  | Tempo gasto
------------------------------------
                    3|  00.000347
                    7|  00.000021
                   21|  00.000003
                   23|  00.000002
                   27|  00.000002
------------------------------------------------
---- Backtracking: sim; Entrada válida: não ----
Tamanho da entrada  | Tempo gasto
------------------------------------
                    3|  00.000011
                    7|  00.000039
                   21|  00.594714
                   23|  02.359978
                   27|  38.033123
------------------------------------------------
---- Backtracking: não; Entrada válida: sim ----
Tamanho da entrada  | Tempo gasto
------------------------------------
                    3|  00.000043
                    7|  00.000005
                   21|  00.000014
                   23|  00.000017
                   27|  00.000024
------------------------------------------------
---- Backtracking: não; Entrada válida: não ----
Tamanho da entrada  | Tempo gasto
------------------------------------
                    3|  00.000008
                    7|  00.000007
                   21|  00.000019
                   23|  00.000020
                   27|  00.000024
------------------------------------------------

The only case of suffering performance was that of invalid entries in a backtracking regex (38 seconds with backtracking versus less than a thousandth for the version without backtracking, for a size 27 input).

The problem is that the number of ways to try to recognize an invalid string in the version without backtracking is associated with number of partitions of an integer number , that is exponential in input size, which causes poor performance.

Solution: When defining a regular expression, make sure that there is a form (or as few forms as possible) only to recognize both valid and invalid entries. Prefer simple expressions such as xx+y instead of (x+x+)y .

In the case of the examples you provided for comparison, they will possess similar performance (good).

  • Another reason to avoid regex is that some recognition engines are simply bad. The implementation of regex for Golang for example is not very performative currently.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.