This link contains a good explanation about, whose ideas I also explain here below, with an example I built and I went around in my machine.
Invalid entries can cause very poor performance with poorly defined regular expressions. An example of poorly defined regular expressions are those recognize entry in more than one possible way, and, consequently, try to recognize invalid entries in several possible ways.
In general, the time taken to recognize an invalid entry can increase exponentially with respect to the number of characters n
of the entrance
Example
The regular expression xx+y
can also be written as (x+x+)+y
. The first expression translates to a deterministic finite automaton, without backtracking, while the second expression uses backtracking. When a valid input is recognized, there is no problem, as the automaton closes the search immediately. When an invalid entry is provided, however, the recognition mechanism will attempt all possible forms of recognition until it is certain that the entry cannot be recognized by Regex.
Here is an example of code and its corresponding output on my machine. The example is in C#, but the concepts are technology agnostic:
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace TesteRegex
{
class Program
{
static void Main(string[] args)
{
string padraoMau = "(x+x+)+y";
string padraoBom = "xx+y"; ;
string[] arrayBom = new string[] { "xxy", "xxxxxxy", "xxxxxxxxxxxxxxxxxxxxy", "xxxxxxxxxxxxxxxxxxxxxxy", "xxxxxxxxxxxxxxxxxxxxxxxxxxy" };
string[] arrayMau = new string[] { "xxx", "xxxxxxx", "xxxxxxxxxxxxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxxxxxxx" };
Console.WriteLine("---- Backtracking: sim; Entrada válida: sim ----");
TestaRegex(padraoMau, arrayBom);
Console.WriteLine("------------------------------------------------");
Console.WriteLine("---- Backtracking: sim; Entrada válida: não ----");
TestaRegex(padraoMau, arrayMau);
Console.WriteLine("------------------------------------------------");
Console.WriteLine("---- Backtracking: não; Entrada válida: sim ----");
TestaRegex(padraoBom, arrayMau);
Console.WriteLine("------------------------------------------------");
Console.WriteLine("---- Backtracking: não; Entrada válida: não ----");
TestaRegex(padraoBom, arrayMau);
Console.WriteLine("------------------------------------------------");
Console.ReadLine();
}
private static void TestaRegex(string padrao, string[] array)
{
Console.WriteLine("Tamanho da entrada | Tempo gasto");
Console.WriteLine("------------------------------------");
Stopwatch sw = new Stopwatch();
for (int i = 0; i < array.Length; i++)
{
sw.Start();
Regex.IsMatch(array[i], padrao);
sw.Stop();
Console.WriteLine(" " + string.Format("{0,2}", array[i].Length) + "| " + sw.Elapsed.ToString("ss\\.ffffff"));
sw.Reset();
}
}
}
}
Exit:
---- Backtracking: sim; Entrada válida: sim ----
Tamanho da entrada | Tempo gasto
------------------------------------
3| 00.000347
7| 00.000021
21| 00.000003
23| 00.000002
27| 00.000002
------------------------------------------------
---- Backtracking: sim; Entrada válida: não ----
Tamanho da entrada | Tempo gasto
------------------------------------
3| 00.000011
7| 00.000039
21| 00.594714
23| 02.359978
27| 38.033123
------------------------------------------------
---- Backtracking: não; Entrada válida: sim ----
Tamanho da entrada | Tempo gasto
------------------------------------
3| 00.000043
7| 00.000005
21| 00.000014
23| 00.000017
27| 00.000024
------------------------------------------------
---- Backtracking: não; Entrada válida: não ----
Tamanho da entrada | Tempo gasto
------------------------------------
3| 00.000008
7| 00.000007
21| 00.000019
23| 00.000020
27| 00.000024
------------------------------------------------
The only case of suffering performance was that of invalid entries in a backtracking regex (38 seconds with backtracking versus less than a thousandth for the version without backtracking, for a size 27 input).
The problem is that the number of ways to try to recognize an invalid string in the version without backtracking is associated with number of partitions of an integer number , that is exponential in input size, which causes poor performance.
Solution: When defining a regular expression, make sure that there is a form (or as few forms as possible) only to recognize both valid and invalid entries. Prefer simple expressions such as xx+y instead of (x+x+)y .
In the case of the examples you provided for comparison, they will possess similar performance (good).
Related http://answall.com/questions/81426/regex-tempo
– Guilherme Lautert
@Wallacemaxters, maybe this content can be useful http://answall.com/questions/5136/performance-de-replaces%C3%A7%C3%A3o-em-string/5239#5239, here he compares Regexp to other forms.
– Rafael Kendrik
To test your expressions you can use http://jsperf.com/
– Rafael Kendrik
I used a lot of the references on this site. http://www.regular-expressions.info and for the question I believe that here you can give a good idea of how much Regex can weigh to evaluate a simple string http://www.regular-expressions.info/catastrophic.html
– Marcos Regis
Related: http://answall.com/q/138705/101
– Maniero