How to emulate the regular expression reset branch in Java

Asked

Viewed 321 times

15

I have this regex:

Pattern p = Pattern.compile("(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");

Basically, it has the following parts:

  • one or more lower-case vowels ([aeiou]+), followed by one or more numbers ([0-9]+), or
  • digits 1, 2 or 3 ([123]+), followed by lowercase letters ([a-z]+)
  • all this followed by one or more non-alphanumeric characters (\W+)

I also have two capture groups: one for vowels and one for digits 1, 2 or 3. How I’m using alternation (|), this means that only one of these groups will be captured. Ex:

Matcher m = p.matcher("ae123.");
if (m.find()) {
    int n = m.groupCount();
    for (int i = 1; i <= n; i++) {
        System.out.format("grupo %d: %s\n", i, m.group(i));
    }
}

In that case, only the first group is captured, and the exit is:

group 1: ae
group 2: null

But if the String for "111abc!!", the second group is captured, and the exit is:

group 1: null
group 2: 111

That is, to know which group was captured, I have to travel them until I find one that is not null.


In some regex enginers it is possible to use branch reset, using (?| at the beginning, which causes the numbering of the groups to be "reset" each time an alternation (|) is found (example). So would suffice change the regex to:

Pattern p = Pattern.compile("(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");

The branch reset ((?|) makes so much ([aeiou]+) how much ([123]+) be group 1 (as there is an alternation - or is one, or the other - this ensures that only one of these expressions is captured). So I wouldn’t need to test if the groups are null, I could take group 1 directly (m.group(1) would have the value I want, without having to do the for in all groups, testing if it is null).

But Java does not support branch reset, and the above code throws an exception:

java.util.regex.PatternSyntaxException: Unknown inline modifier near index 2
(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\W+
  ^

I’m using Java 8, but from what I saw on java 14 documentation, this feature is not yet supported by the regex API (and preview of Java 15 there is also no mention of this resource).

I also saw this solution for . NET, which consists of using named groups and put the same name in all groups, but in Java also does not work:

Pattern p = Pattern.compile("(?:(?<grupo>[aeiou]+)[0-9]+|(?<grupo>[123]+)[a-z]+)\\W+");

This code makes an exception because in Java groups with the same name are not allowed:

java.util.regex.PatternSyntaxException: Named capturing group <grupo> is already defined near index 36
(?:(?<grupo>[aeiou]+)[0-9]+|(?<grupo>[123]+)[a-z]+)\W+
                                    ^

Is there any way to emulate branch reset in Java or the only solution is to make a loop in the groups, testing if they are null?

2 answers

7

I found an alternative not very "elegant" (and with limitations, explained below), using replaceAll:

String regex = "(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+";
System.out.println("ae123.".replaceAll(regex, "$1$2"));
System.out.println("111abc!!".replaceAll(regex, "$1$2"));

This prints:

ae
111

The trick is in the second parameter. "$1$2" means I’m concatenating group 1 ($1) with group 2 ($2). How there is an alternation in regex (|), only one of the groups will be captured and the other will be empty, so when concatenating them, the result will always be the value of the group that was captured.


But as said at the beginning, this approach has some limitations. Assuming the regex is a bit more complicated, with several groups:

(1) | (2) (3) (4) | (5) (6) | (7) | (8)

In this case, I may have only group 1 captured, or only groups 2, 3 and 4, or only 5 and 6, or only 7, or only 8. I could still use replaceAll with "$1$2$3$4$5$6$7$8", but in the case of having groups 2, 3 and 4, they would be concatenated and I would not be able to get the value of each separately. Unless I use some separator, like "$1,$2,$3,$4,$5,$6,$7,$8" and then make a split, but then it starts to become "gambiarra" too.

With branch reset, the numbering of the groups would be:

(?| (1) | (1) (2) (3) | (1) (2) | (1) | (1) )

And just make a simple loop in the groups (always starting with 1 and going to m.groupCount()).

I mean, I keep waiting for other solutions

  • 1

    I believe that we can not simulate branch reset in a clean way in Java.

  • @Victorstafusa Yeah, I was trying some juggling with regex, and I’m starting to think that either it’s not possible, or it is, but it’s not worth the complication. But let’s wait, will someone come up with an answer :-)

1

To date the current version of Java is 16, and we still do not have available the branch reset. Therefore, another alternative - still far from ideal - is to use lookarounds:

Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

The above code makes both the ae as to the 111 are in group 1, simulating the behavior of branch reset (so I don’t need to test if the match is in group 1 or 2).

Basically, I use a alternation (the |, which means "or") with two options. The first search for vowels, and with a Lookahead who checks whether after them he has \\d+\\W+ (digits and \W+). But since this last part is inside the Lookahead - that is, within (?= ) - that won’t be part of the match, since the lookarounds sane zero-length assertions: they only check if something exists (hence the assertion), but its content is not returned on match (hence the "zero length").

The second option looks for the digits 1, 2 or 3, and what comes next (the letters and the \W+) stay in another Lookahead.

All this is in parentheses, forming a single capture group. So either the vowels, or the numbers 1, 2 or 3 (but not what comes after them) will be in this group. Hence the Matcher just need to check group 1.


This could solve the simplest cases, but what if I wanted to take two groups? For example, if the numbers that come after vowels, and the letters after 1/2/3 also had to be in a group (i.e., in group 2). With branch reset, just do:

(?|([aeiou]+)([0-9]+)|([123]+)([a-z]+))\W+

But using the lookarounds, I would have to make a similar solution by putting another alternation in group 2:

Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))(\\d+|[a-z]+)(?=\\W+)");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}

In this case, group 2 is a little simpler, as it has only the digits or letters. The problem is redundancy, because I have to repeat the digits and letters in lookaheads group 1, and again in group 2. That’s because the Lookahead only looks at what is ahead, but then goes back to where it was (in this case, it goes back to the point immediately after group 1). So that the characters in group 2 are consumed and are part of the match, you need to put them back in the expression.

That is, if they had more groups, it would be even more redundant, with parts of the expression repeating several times, to the point of becoming impractical. Not to mention that this still does not solve very well the cases where each alternation can have a different amount of groups (the regex would be even more complicated to understand and maintain).


In other words, there is still no good solution that suits all cases. Perhaps the way is to do as the question suggests: iterate through the group, checking if it is null.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.