Different regex strategies to get the same result

Asked

Viewed 129 times

5

I have the following input:

Detalhamento de Serviços nº: 999-99999-9999

I need to get the number in a group, for that I would use:

Detalhamento de Serviços nº: (\d+-\d+-\d+)

However I can’t trust whether or not there will be the string nº:(NOTE: not the phone number, but this string). Then I would have 2 options:

 1. Detalhamento de Serviços.+(\d+-\d+-\d+) 
 2. Detalhamento de Serviços[\D]+(\d+-\d+-\d+)

Both regex would return the same result, the doubt is:

What is the difference between using the class "any Character" and "non-digit" in this case? What is the best practice and why? Which has the highest performance and why?

  • 1

    Taking into account only the name, "any Character" is literally any character, whereas "non-digit" is any character except numbers, is it not? So they’re not the same.

  • Yes, but to search for the input I put as example, both serve and return the desired result. The question is: which is the right one to use and pq?

1 answer

3


In fact, the two regex you indicated do not return the same result. I ran a test on JDK 1.7.0_80, and it is also possible to see them working (differently) here and here.

I created a very simple method to test a regex:

public void testregex(String input, String regex) {
    Matcher matcher = Pattern.compile(regex).matcher(input);
    if (matcher.find()) {
        System.out.println(matcher.group(1));
    }
}

Then I tested the same input using the two regex (detail that the \ must be escaped, so it is written as \\):

String input = "Detalhamento de Serviços nº: 999-99999-9999";
testregex(input, "Detalhamento de Serviços.+(\\d+-\\d+-\\d+)");
testregex(input, "Detalhamento de Serviços\\D+(\\d+-\\d+-\\d+)");

The result was:

9-99999-9999
999-99999-9999

This is because the quantifiers + and * are "greedy" and try to get as many characters as possible. In the first case, it also takes the first two digits 9, because the rest of the String (9-99999-9999) also satisfies the last part of regex (\d+-\d+-\d+).

In the second case, he doesn’t take the first two 9 because \D ensures it won’t pick up digits.

Therefore, some possible solutions are:

  • Use the \D: so you guarantee that, as much as the quantifier is greedy, it won’t pick up a digit by mistake
  • Use a ? right after the quantifier +, for that cancels the "greedy behavior". The regex looks like this: Detalhamento de Serviços.+?(\d+-\d+-\d+) - note the use of .+? to remove the "greed"
  • Set the number of digits using {}. For example, if the number of digits is always "3-5-4", you can use Detalhamento de Serviços.+?(\d{3}-\d{5}-\d{4}). If the number of digits varies, use the syntax {min,max}. For example, if there is a 2-digit minimum and a 3-digit maximum, use {2,3} (and use the "cancel of greed", or the \D to ensure). Adapt according to your need.
  • 1

    Oops, that’s just perfect. I had not paid attention to the return by matcher.group, I had just used matcher.find to return whether or not I found it or not.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.