Regular Expression - Split disregarding anything within parentheses

Asked

Viewed 228 times

0

Can someone help me with a regular Java expression that separates the text by . (point) disregarding what is within the parentheses. For example:

abacaxi.laranja.(pera.banana)limao.mamao 

You have to generate the result

abacaxi
laranja
(pera.banana)limao
mamao

1 answer

1


The problem is to detect that the point is within parentheses, because the regex must check if there is the corresponding opening and closing, and if they are properly balanced, etc.(besides the case of having parentheses within others). And for that you would need to use recursive regex, that Java nay supports. And even if it did, it’s not the simplest way to solve this problem (to get an idea of what a recursive regex looks like, see examples here and here).

I think it is simpler to go through the string and keep a parenthesis count, and go around keeping the positions of the points, ignoring those inside parentheses.

Then I use these positions to get snippets of the string, using the method substring:

String s = "abacaxi.laranja.(pera.banana)limao.mamao";
int open = 0; // contagem de parênteses abertos
List<Integer> posicoes = new ArrayList<>();
for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (c == '(') {
        open++;
    } else if (c == ')') {
        open--;
    } else if (c == '.' && open == 0) {
        // achei um ponto que não está dentro de parênteses
        posicoes.add(i);
    }
}

List<String> partes = new ArrayList<>();
int posInicial = 0;
for (int pos : posicoes) {
    // usar as posições em que estão os pontos para obter substrings
    partes.add(s.substring(posInicial, pos));
    posInicial = pos + 1;
}
// não esquecer de adicionar o último trecho (do último ponto até o final da string)
partes.add(s.substring(posInicial));
System.out.println(partes);

The result is a list of 4 elements (when printing the List, the elements are thus shown separated by a comma):

[pineapple, orange, (pear.banana)lemon, mamao]

This code even works for parentheses inside others (for example, if the string is abacaxi.laranja.(pera.banana(abc.def.ghi))limao.mamao, all the way (pera.banana(abc.def.ghi))limao is considered one thing only).


The above code assumes that pairs of parentheses will always be right (for each ( there will be the ) corresponding). As it is not clear what should happen if it has unbalanced parentheses (with ( or ) missing or left over), I will leave so for now.


You can’t even use regex?

Although Java does not support recursive regex, you can do a regex limited for your specific case (within a pair of parentheses nay can have another pair of parentheses). But instead of split, I’m gonna use a java.util.regex.Matcher to look for the passages I want (which gives in the same, after all, split and match are two sides of the same coin - what changes is the logic: no split I say what nay I want it to be in the final result, and in the match I say what I want):

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Matcher matcher = Pattern.compile("(?:\\([^()]+\\))?[^.()]+(?=\\.|$)").matcher(s);
partes = new ArrayList<>();
while (matcher.find()) {
    partes.add(matcher.group(0));
}
System.out.println(partes);

The result is the same as the previous code. The excerpt (?:\\([^()]+\\))? search for parentheses \\( and \\) which have within them any characters which nay in parentheses (the words [^()]+), and the ? at the end makes this whole stretch optional.

Next we have [^.()]+, which is "one or more characters that are neither parentheses nor dots".

And then there’s the Lookahead (the stretch with (?=), and within it we have \\.|$ (a point or the end of the string). The detail is that the Lookahead only checks what is in front, but it is not part of the match.

Then I check all the pouch and add to the list. At the end I have the list with the desired parts.

Remembering that this regex assumes that there are no parentheses within others, and that the pairs are always balanced. For strings with parentheses inside others, it’s best to use the first code I suggested (but it also assumes that pairs are balanced).

Another detail is that it only checks parentheses that occur just after the point. If the string is "abacaxi.laranja(pera.banana).mamao", you can change to:

Matcher matcher = Pattern.compile("[^.()]*(?:\\([^()]+\\))?[^.()]*(?=\\.|$)").matcher(s);
partes = new ArrayList<>();
while (matcher.find()) {
    String parte = matcher.group(0);
    if (parte.length() > 0)
        partes.add(parte);
}
System.out.println(partes);

Now I use * instead of + (to catch zero or more occurrences). So now I need to test if the parts have zero size, because some empty strings are captured by this regex. But - again - this regex does not work for parentheses within others (for these cases, prefer the first code, without regex).

  • I used Regex, and it worked perfectly. Thank you so much for your help.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.