Split String into blocks, considering only letters or text in square brackets

Question

Split String into blocks, considering only letters or text in square brackets

Asked 5 years ago

Viewed 75 times

1

If I have a string like that:

"Olá, o meu nome é [José Cavalo]"

And I der split:

{"Olá,", "o", "meu", "nome", "é", "[José", "Cavalo]"}

First of all, we have a problem. At 0, we have "Olá,", and we want to separate the "Olá" comma. Then I wanted to separate the String in blocks, sensing straight parentheses, and joining together "José" with "Cavalo", getting "[José Cavalo]", in a single element (block, as explained in the title)

I thought to use regex for the block case:

Matcher m = Pattern.compile("\\[([^)]+)\\]").matcher("Olá, o meu nome é [José Cavalo]");
while(m.find()) {
    System.out.println(m.group(1));
}

But the behavior of this piece of code is not what is expected. If we have multiple blocks, it just makes the first square parenthesis with the last straight parenthesis of all:

//Consideremos a seguinte String:
String s = "[bloco1] [bloco2]";
//... Aqui vai o código do regex para a separacao ...

//Resultado:
//bloco1] [bloco2

We already have two problems here:

The first is the "first straight parenthesis > last straight parenthesis of all" problem. The second is that he remove the square brackets. I don’t want that because I already have a separate function to remove them.

I have no idea how to separate the comma from the comma "Olá".

1 answer

Browser other questions tagged java string regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-06-26T12:51:08+00:00

In your regex you used [^)], which is "any character that nay be it )". And the quantifier +, by default, is "greedy" and tries to pick up as many characters as possible (so in case you have "[bloco1] [bloco2]", he takes the whole string, since you just said you couldn’t take it ), then the regex can take [ or ] the more, if you think it necessary - and the "greedy" behavior says it is to catch).

If you don’t want me to pick up the brackets, include them in the character class (remembering to slip them with \):

Matcher m = Pattern.compile("\\[([^\\[\\]]+)\\]")
                   .matcher("Olá, o meu nome é [José Cavalo] [bloco 1] [bloco 2]");
while (m.find()) {
    System.out.println(m.group(1));
}

The exit is:

José Cavalo
bloco 1
bloco 2

If you want the result to be [José Cavalo], [bloco 1] and [bloco 2] (with brackets), simply change the m.group(1) for m.group(0) (or just m.group(), no parameters - both take all the match found).

You could also use "\\[([^\\[\\]()]+)\\]" (include the ( and ) in the character class), so regex does not take brackets or parentheses.

Now to catch only Olá (without the comma) you can use the shortcut \w:

String s = "Olá, o meu nome é [José Cavalo]";
Matcher matcher = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS).matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

I also used the option Pattern.UNICODE_CHARACTER_CLASS to consider accented characters.

And to take both the word without the comma and the case of the square brackets, use alternation (the character |, which means "or"):

String s = "Olá, o meu nome é [José Cavalo] [bloco 1] [bloco 2]";
Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\w+", Pattern.UNICODE_CHARACTER_CLASS).matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Thus, regex first tries to pick up characters between brackets. If it fails, it tries to pick up the sequence of \w.

The exit is:

Olá
o
meu
nome
é
[José Cavalo]
[bloco 1]
[bloco 2]

One detail is that \w also considers numbers and the character _. If you want to limit it to just letters, another option is to use the Unicode Properties:

Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\p{L}+").matcher(s);

In the case, \p{L} takes all letters defined by Unicode (all categories starting with "L" from this list). But this includes other alphabets, such as Japanese, Arabic, Cyrillic, etc. If you want to limit yourself to just our alphabet, another option is to use \p{Script=Latin}:

Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\p{Script=Latin}+").matcher(s);

In both cases, you do not need to use the option Pattern.UNICODE_CHARACTER_CLASS.