15
I have this regex:
Pattern p = Pattern.compile("(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");
Basically, it has the following parts:
- one or more lower-case vowels (
[aeiou]+
), followed by one or more numbers ([0-9]+
), or - digits 1, 2 or 3 (
[123]+
), followed by lowercase letters ([a-z]+
) - all this followed by one or more non-alphanumeric characters (
\W+
)
I also have two capture groups: one for vowels and one for digits 1, 2 or 3.
How I’m using alternation (|
), this means that only one of these groups will be captured. Ex:
Matcher m = p.matcher("ae123.");
if (m.find()) {
int n = m.groupCount();
for (int i = 1; i <= n; i++) {
System.out.format("grupo %d: %s\n", i, m.group(i));
}
}
In that case, only the first group is captured, and the exit is:
group 1: ae
group 2: null
But if the String
for "111abc!!"
, the second group is captured, and the exit is:
group 1: null
group 2: 111
That is, to know which group was captured, I have to travel them until I find one that is not null.
In some regex enginers it is possible to use branch reset, using (?|
at the beginning, which causes the numbering of the groups to be "reset" each time an alternation (|
) is found (example). So would suffice change the regex to:
Pattern p = Pattern.compile("(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");
The branch reset ((?|
) makes so much ([aeiou]+)
how much ([123]+)
be group 1 (as there is an alternation - or is one, or the other - this ensures that only one of these expressions is captured). So I wouldn’t need to test if the groups are null, I could take group 1 directly (m.group(1)
would have the value I want, without having to do the for
in all groups, testing if it is null).
But Java does not support branch reset, and the above code throws an exception:
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 2
(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\W+
^
I’m using Java 8, but from what I saw on java 14 documentation, this feature is not yet supported by the regex API (and preview of Java 15 there is also no mention of this resource).
I also saw this solution for . NET, which consists of using named groups and put the same name in all groups, but in Java also does not work:
Pattern p = Pattern.compile("(?:(?<grupo>[aeiou]+)[0-9]+|(?<grupo>[123]+)[a-z]+)\\W+");
This code makes an exception because in Java groups with the same name are not allowed:
java.util.regex.PatternSyntaxException: Named capturing group <grupo> is already defined near index 36
(?:(?<grupo>[aeiou]+)[0-9]+|(?<grupo>[123]+)[a-z]+)\W+
^
Is there any way to emulate branch reset in Java or the only solution is to make a loop in the groups, testing if they are null?
I believe that we can not simulate branch reset in a clean way in Java.
– Victor Stafusa
@Victorstafusa Yeah, I was trying some juggling with regex, and I’m starting to think that either it’s not possible, or it is, but it’s not worth the complication. But let’s wait, will someone come up with an answer :-)
– hkotsubo