Solution with \W
In regular expressions implemented in Java, as per class documentation Pattern
, there is a character class \w
(minuscule), which represents the characters that form the word. It would be the same as [a-zA-Z_0-9]
.
There is also the character class \W
(capital), which represents the opposite of the previous, i.e., characters that do not form words.
A simplistic solution would be to use \W
to break the string by any character other than a word, including any punctuation and space.
But there are problems with this approach:
- Does not consider special characters that are commonly part of words, as is the case of hyphen, for example.
- Does not consider accented characters as they are not part of the word set of the
\w
.
Specific solution
A more specific solution would be to define a set of characters that must "break" the String. Example:
String caracteres = " #@_\\/.*";
Then you place these characters between brackets, which in regular expressions means a custom class of characters. Example:
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
The method Pattern.quote
above ensures that the characters will receive the escape necessary not to spoil the regular expression.
Full example
String line = "1 2#3@4_5/6.7*8";
String caracteres = " #@_\\/.*";
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
for (String string : words) {
System.out.print(string + " ");
}
Exit:
1 2 3 4 5 6 7 8
Special characters in sequence
With the above expression, blank words can be found in the vector if two special characters or spaces are found in sequence. This is common in the case of a sentence containing a period or comma followed by a blank.
To prevent this from happening, just add a +
direct from the customized class so that the split
capture the string of special characters in a single block at once. Example:
String words[] = line.split("[" + Pattern.quote(caracteres) + "]+");
It worked! I could explain better what the
[\\W]
means there?– Renan Gomes
@mxn In regular expressions
\W
means any character other than a letter or number or a underline.– utluiz
thanks utluiz, was looking for this information to join the answer
– jsantos1991
@mxn notes that I edited the regular expression like this
[\\W][ ]
so remove the spaces too, as was[\\W]
keeps the spaces also...– jsantos1991
The edition you made has stopped working:
a.split("[\\W][ ]");
doesn’t work.a.split("[\\W]");
works. The result ofa.split("[\\W][ ]");
is the same posted by @bigown. It has how to edit and leave as was?– Renan Gomes
i am running this code, to see if it was correct(so I detected the spaces) but curious that with me it worked right
– jsantos1991
In my case it didn’t work. If possible, leave the two implementations then, because only the first one worked right here.
– Renan Gomes
it is very strange, but yes I will make this edition...
– jsantos1991