In your regex you used [^)]
, which is "any character that nay be it )
". And the quantifier +
, by default, is "greedy" and tries to pick up as many characters as possible (so in case you have "[bloco1] [bloco2]"
, he takes the whole string, since you just said you couldn’t take it )
, then the regex can take [
or ]
the more, if you think it necessary - and the "greedy" behavior says it is to catch).
If you don’t want me to pick up the brackets, include them in the character class (remembering to slip them with \
):
Matcher m = Pattern.compile("\\[([^\\[\\]]+)\\]")
.matcher("Olá, o meu nome é [José Cavalo] [bloco 1] [bloco 2]");
while (m.find()) {
System.out.println(m.group(1));
}
The exit is:
José Cavalo
bloco 1
bloco 2
If you want the result to be [José Cavalo]
, [bloco 1]
and [bloco 2]
(with brackets), simply change the m.group(1)
for m.group(0)
(or just m.group()
, no parameters - both take all the match found).
You could also use "\\[([^\\[\\]()]+)\\]"
(include the (
and )
in the character class), so regex does not take brackets or parentheses.
Now to catch only Olá
(without the comma) you can use the shortcut \w
:
String s = "Olá, o meu nome é [José Cavalo]";
Matcher matcher = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS).matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
I also used the option Pattern.UNICODE_CHARACTER_CLASS
to consider accented characters.
And to take both the word without the comma and the case of the square brackets, use alternation (the character |
, which means "or"):
String s = "Olá, o meu nome é [José Cavalo] [bloco 1] [bloco 2]";
Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\w+", Pattern.UNICODE_CHARACTER_CLASS).matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Thus, regex first tries to pick up characters between brackets. If it fails, it tries to pick up the sequence of \w
.
The exit is:
Olá
o
meu
nome
é
[José Cavalo]
[bloco 1]
[bloco 2]
One detail is that \w
also considers numbers and the character _
. If you want to limit it to just letters, another option is to use the Unicode Properties:
Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\p{L}+").matcher(s);
In the case, \p{L}
takes all letters defined by Unicode (all categories starting with "L" from this list). But this includes other alphabets, such as Japanese, Arabic, Cyrillic, etc. If you want to limit yourself to just our alphabet, another option is to use \p{Script=Latin}
:
Matcher matcher = Pattern.compile("\\[([^\\[\\]]+)\\]|\\p{Script=Latin}+").matcher(s);
In both cases, you do not need to use the option Pattern.UNICODE_CHARACTER_CLASS
.