How to use more than one separation character in the split() method?

Asked

Viewed 17,342 times

15

I’d like to break a String in various substrings, for this I am using the method split(). Turns out, I’m not sure which characters might be in the variable I use.

Exemplifying:

String words[] = line.split(" ");

This code meets what I need, I am considering that will be used only the " " to separate the words. But the problem is that this entry will be read from a text file where user can put any character between words.

So I would need to create something like:

String words[] = line.split(" #@_\\/.*");

Is it possible to do this in Java? Any solution?

3 answers

15


One possibility is:

    String a = "Exemplo, de. separar- string+ por* carater";
    //Como quer todos os caracteres pode usar esta expressão regular:
    String[] parts = a.split("[\\W]");

    for(String i:parts){
        System.out.println("===" +i);
    }

Output:

run:
===Exemplo
===
===de
===
===separar
===
===string
===
===por
===
===carater

To remove spaces you must also change this line of code:

String[] parts = a.split("[\\W][ ]");

Output:

===Exemplo
===de
===separar
===string
===por
===carater
  • It worked! I could explain better what the [\\W] means there?

  • 2

    @mxn In regular expressions \W means any character other than a letter or number or a underline.

  • thanks utluiz, was looking for this information to join the answer

  • @mxn notes that I edited the regular expression like this [\\W][ ] so remove the spaces too, as was [\\W] keeps the spaces also...

  • The edition you made has stopped working: a.split("[\\W][ ]"); doesn’t work. a.split("[\\W]"); works. The result of a.split("[\\W][ ]"); is the same posted by @bigown. It has how to edit and leave as was?

  • i am running this code, to see if it was correct(so I detected the spaces) but curious that with me it worked right

  • In my case it didn’t work. If possible, leave the two implementations then, because only the first one worked right here.

  • it is very strange, but yes I will make this edition...

Show 3 more comments

14

Solution with \W

In regular expressions implemented in Java, as per class documentation Pattern, there is a character class \w (minuscule), which represents the characters that form the word. It would be the same as [a-zA-Z_0-9].

There is also the character class \W(capital), which represents the opposite of the previous, i.e., characters that do not form words.

A simplistic solution would be to use \W to break the string by any character other than a word, including any punctuation and space.

But there are problems with this approach:

  • Does not consider special characters that are commonly part of words, as is the case of hyphen, for example.
  • Does not consider accented characters as they are not part of the word set of the \w.

Specific solution

A more specific solution would be to define a set of characters that must "break" the String. Example:

String caracteres = " #@_\\/.*";

Then you place these characters between brackets, which in regular expressions means a custom class of characters. Example:

String words[] = line.split("[" + Pattern.quote(caracteres) + "]");

The method Pattern.quote above ensures that the characters will receive the escape necessary not to spoil the regular expression.

Full example

String line = "1 2#3@4_5/6.7*8";
String caracteres = " #@_\\/.*";
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
for (String string : words) {
    System.out.print(string + " ");
}

Exit:

1 2 3 4 5 6 7 8

Special characters in sequence

With the above expression, blank words can be found in the vector if two special characters or spaces are found in sequence. This is common in the case of a sentence containing a period or comma followed by a blank.

To prevent this from happening, just add a + direct from the customized class so that the split capture the string of special characters in a single block at once. Example:

String words[] = line.split("[" + Pattern.quote(caracteres) + "]+");

6

I think this solves your problem:

import java.io.*;

 class Test{
   public static void main(String args[]){
      String line = new String("banana*batata.pepino#alface_tomate@cenoura cebola/abacate|morango\\laranja");
      for (String retval: line.split(" |#|@|_|\\\\|\\/|\\.|\\*") ){
         System.out.println(retval);
      }
    
   }
}

Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.

I’m using an operator or of the Regex after all split() is based on Regex.

  • I don’t know if Regex is right because I don’t usually use it but this is basically it.

  • It didn’t work for me. It generated a String empty, do not know what was "splited" but were not the characters, hehe. Still, thank you.

  • I had a little problem but now I tested it and it worked on what I tested.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.