Manipulating String in Java

Asked

Viewed 7,414 times

9

I have a text inside a String. I will go through this String. Going through it I need to pick up every word it contains. I thought I’d use string.split(" "); but I need to treat "." ;" "," ":" "!" "?" among other cases. How can I do that?

  • 1

    You want to split these characters, or remove these characters from the sentence?

  • @Fernando I have to take word for word, being the last character of substring the last letter of the word.

2 answers

8


You can use Regex. Example:

public class TesteRegex {
    public static void main(String[] args) {
        String frase = "Várias palavras em uma só String.\n"
                + "Ignorando pontos; Ponto-e-vírgula; Traços. E números 0132.";
        Pattern p = Pattern.compile("[a-zA-Zà-úÀ-Ú]+");
        Matcher m = p.matcher(frase);
        int i = 1;
        while(m.find()) {
            System.out.println("Palavra " + i + ": " + m.group());
            i++;
        }
        System.out.println("Frase completa: " + frase);
    }
}

Upshot:

Word 1: Several
Word 2: words
Word 3: in
Word 4: a
Word 5: only
Word 6: String
Word 7: Ignoring
Word 8: dots
Word 9: Point
Word 10: and
Word 11: comma
Word 12: Strokes
Word 13: AND
Word 14: numbers
Full Sentence: Multiple Words in One String.
Ignoring dots; Semicolon; Dashes. And 0132 numbers.

The Pattern I used [a-zA-Zà-úÀ-Ú]+ informs that it is to include everything that goes from a until z and everything from à until ú, for both upper and lower case cases. The sign of + indicates to take groups instead of single characters.

Consequently everything else will be ignored, this includes all spaces, special characters and numbers, as you can see in the above example.

Looking at the Unicode character list we can see that the track that goes from à until ú takes some characters that may be considered undesirable, such as æ, å, ÷ and the ø. See the complete excerpts:

From À Ú: À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú

From to the: à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú

In case you are picking up files from various sources you can come across them at certain times, if you are reading through a Textfield that the user is typing I would say that it is unnecessary to delete such characters from the list because hardly anyone will type a Å in the middle of a text, our keyboard isn’t even ready for it (I had to copy and paste it myself).

But if you prefer, you can use a more specific Pattern that accepts only the characters we use in our alphabet, which would be this: [a-zA-ZàáâãçèéêìíòóõùúÀÁÂÃÇÈÉÊÌÍÎÒÓÔÕÙÚ]+

Notice the sign of - indicates value ranges, so à-ú accepts everything from à until ú, and in the above Pattern I did not use the range of values for the accented characters, I specified one by one which are those characters that should be accepted. For the unstressed I kept the a-zA-Z, because there is no unwanted character among them.

  • with that token the token is included?

  • 1

    @Peace yes, sir! O à would be the first of all "accented" characters, and the ú is the last, it means that the ç is included.

  • 1

    @I put the complete list of characters ranging from à to ú and created a Pattern that eliminates some characters that may be considered undesirable in your application. Since I don’t know where you’re reading these characters from it might be a good idea to use it, despite the à-ú already be very efficient for most situations.

  • 1

    it worked out.. I changed the Pattern you changed -----once again, thank you very much! I think it’s a milessima time q vc save me :D

  • I think this regex could be replaced by "[\\p{L}]+" could not?

  • 2

    @Rodrigosasaki first of all wanted to say that I had to research to answer you. According to this site, the Pattern [\\p{L}]+ accepts all Unicode letters. I’m not sure which are all the letters, but I know that this Pattern is much more comprehensive than the two that I include in my reply. I did some tests and realized that he accepted numerous different characters, for example: . Therefore, I believe that the best is to make a long Pattern but that ensures the proper functioning of the AP program.

  • @Math understood. It’s really more comprehensive than I imagined. Great answer and solution!

  • @Math I need to recognize when there is n in the string. How can I do this? I use the Pattern tbm?

  • 1

    @Peace depends, if I understand well enough you put a \n within the [ and ]. It could be after Ú. See if that’s what you need, if you don’t explain what happened and how you need it to be.

  • @Math when I put n after the U it error in execution: java.lang.Arrayindexoutofboundsexception: 65514

  • 1

    @Are you sure this is the reason why Exception occurs? I did some tests here and ran normal. I see in the entire stack trace what caused this, because it sounded strange. And if that’s the case explain a little better the application of your program, because from what I understand you want to accept n in your input text, but this causes an effect that I’m not sure is desirable, you will be storing an n in your String, while you could just throw it away.

  • @Math funfou o/

Show 7 more comments

3

You can use regular expression for this purpose.

Split

The method split of String Java accepts regular expression. See here in the documentation.

Something like that: [.;,:!?] (is a group of characters you want to filter).

This will split the specified characters, returning an Array.

In Java it would be something like this:

String str = "Eu sei? que nada, sei, mais uns .'s e umas ,'s";
String[] result = str.split("[.;,:!?]");
for (String r : result) {
    System.out.println(r.toString());       
}

The way out would be:

Eu sei
 que nada
 sei
 mais uns 
's e umas 
's

Replace

You can also make a replace in undesirable characters. The method replaceAll java string also accepts regular expression, see here in the documentation.

It would be something like that:

String result2 = str.replaceAll("[.;,:!?]", "");
System.out.println(result2);

The way out would be:

Eu sei que nada sei mais uns 's e umas 's

From what I understand, this is something you look for. Right?

See if it suits you.

  • i can’t mainly replace the end points of phrase '!' '.' '?' because then I will need to distinguish different sentences. I can’t change the text.

  • @Peace, I don’t understand, you don’t want to remove the '!' '.' '?'? But do you want to split for them? Explain it better. If possible examples.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.