Creating regular expressions with a dynamic pattern is problematic? If so, is there a way to avoid the problem?

Question

Creating regular expressions with a dynamic pattern is problematic? If so, is there a way to avoid the problem?

Asked 4 years ago

Viewed 259 times

6

Let’s say, for some reason, I need to create a regular expression that has a part of its standard configurable by a user.

Something like this very trivial example:

const regex = new RegExp('^' + possiblyUnsafeUserInput, 'i');

However, knowing a little bit of regular expressions, I know this can be problematic, since the user can make use of the characters and special sequences that regular expressions support, such as (, ), ?, \s, \d, among many others.

So I’d like to ask:

How problematic is this? What problems can it bring me?
Is there any way to solve these problems? If so, how?

^{The question is more focused on the Javascript ecosystem, but in theory can be language independent as well.}

1 answer

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-06-19T12:41:18+00:00

One way to "solve" is simply by escaping metacharacters, putting a \ before them. Something like this:

possiblyUnsafeUserInput = possiblyUnsafeUserInput.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
const regex = new RegExp('^' + possiblyUnsafeUserInput, 'i');

^{The above code was based in this answer.}

Brackets define a class of characters, and within them many characters do not need to be escaped (the exceptions are the bars, which are the regex delimiters, in addition to the closing bracket and the closing bracket itself \, who need the \ before).

As described in documentation, $& corresponds to the whole string found by regex (which in this case will always be one of the metacharacters - and thanks to flag g, I make the substitution in all of them, putting the \ before). Thus, the metacharacters are escaped and are interpreted as "normal characters".

But will you need it?

In your specific example, if the user typed something like "(abc)" and you want to check if the string starts with "(abc)", you could just do:

let comecaCom = algumaString.startsWith(possiblyUnsafeUserInput);

Similarly, if you want to see if the string is in the middle or at the end, you could use algumaString.includes(possiblyUnsafeUserInput) or algumaString.endsWith(possiblyUnsafeUserInput). If the idea is just to check if the user-typed string is contained in another string, you do not need regex.

Of course if the typed string is part of something more complex (I want to check if this string is followed by some other pattern that is better expressed by regex, such as if it is followed by one or more numbers, followed by spaces and something else, etc.)Then it would make more sense to use the regex. And in this case, it would be enough to escape possiblyUnsafeUserInput and concatenate into the rest of the expression. Ex:

possiblyUnsafeUserInput = possiblyUnsafeUserInput.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
// string deve estar no início e seguida de um ou mais números
const regex = new RegExp('^' + possiblyUnsafeUserInput + '\\d+', 'i');

One of the problems, for me, is the same as using regex itself: it doesn’t always get an easy to understand code, and it’s very easy to get lost in the middle of a lot of \ across the string. But there’s also another...

Redos and dangers of creating regex with user inputs

There is a type of attack called Redos (Regex Denial of Service), which basically consists of sending a regex that generates a lot backtracking, can "break" the application (since it can take too long to run, consuming resources and locking the machine, for example).

For example, something like ^((ab)*)+$ (whereas (ab)* is "zero or more occurrences of ab", and this can all be repeated once or more). If the string is something like "ababab a", the engine will try various possibilities until you realize that there is no match. See here that it needs about 70 steps to realize this, and if we double the size of the string, the amount of steps needed increases to more than 500 (and just adding an "ab" to it, it increases to more than 1000).

^{The exact amount of steps varies depending on the implementation of each language/engine, but anyway, the growth will always be exponential.}

This is because nested quantifiers generate many different possibilities (it can be an "ab" that repeats several times, or "abab" that repeats several times, an "ab" followed by several "ababab", etc.), and the regex tests them all before realizing that there is none match (even though it seems redundant for us to verify all of this, this is how the Engines modern usually do). This is all explained in more detail here.

Of course, the "ab" example is kind of "silly" and only serves to illustrate the problem, but if instead of "a" and "b" some expression was used that corresponds to several different characters (such as \w or even the .), then regex would have to check several repetitions of several different characters. The possibilities increase exponentially, generating a "catastrophe" and the string doesn’t even have to be that big to "break" the engine.

For example, the regex ^((..)*)+$ search for one or more occurrences of (..)* (which in turn is zero or more occurrences of any 2 characters). In regex101.com, a string with only 35 characters has already been able to "break" the engine, see here. 29 characters long, a engine needed more than 160,000 steps to realize that there is no match (see), because a repetition within another generates numerous search possibilities (it may be an occurrence of .. repeated several times, or an occurrence of .. followed by 2 occurrences of .., followed by 3 occurrences, etc, finally, are so many possibilities that the engine ends up "breaking" - and even if it doesn’t break, the execution time can be so high that ends up having impacts on the application).

Of course, the exact amount of steps and the size of the input causing the problem varies from one language/engine/API to another, since it depends on implementation details, on the regex used, on the strings being tested, whether the API does some internal optimization depending on the case, etc. But in general, they are susceptible to this type of attack.

So, it is important to validate entries before creating a regex with any string. In this specific case, I believe that the escape of metacharacters already avoids many cases as the one mentioned above, but I have not researched enough and I believe until there must be some "smart" regex that can circumvent this solution.

There are even some libs that "promise" to protect you against these malicious regex (such as that one and that one, that I didn’t get to test).

Finally, this is not as serious as eval (who was quoted in the comments), since the builder of RegExp only checks if the string is a valid expression (there is no arbitrary code execution), but anyway it is always good to validate and sanitize the inputs.

And as I said, in your particular case you may not even need regex.

Other languages have specific methods that already make the escape correctly, such as Python and Java. Other Engines support shortcuts \Q and \E, that serve to escape all that is between them (ie, \Q[]\E would be the same as \[\]). Unfortunately, Javascript does not support these shortcuts and does not have a method that already makes the escape.