How to extract sentences from a text in Java?

Asked

Viewed 417 times

-1

I recently read an article that analyzed the size of the sentences of several authors. It was a stylistic study of his works.

How do I read a text (with several paragraphs) and extract your sentences? Preferably in Java.

1 answer

6

Starting with the simplest example, we assume that a sentence ends in a dot, followed by a space (or line break):

She’s from Rio. He, from São Paulo.

All it would take is one split() of the string using the dot followed by any space character, remembering to escape the characters with \:

s.split("\\.\\s+");

But we must also consider exclamation point and question mark:

Where have you been? I was worried!

For this we will use a Positive lookbehind regex:

s.split("(?<=[.!?])\\s+");

But we have to consider that some phrases may be in single or double quotes, in the case of dialogues.

"Am I old today?" - said my father.

For this we will incorporate these elements in Pattern, remembering that the indent is a character that can be removed or maintained (depending on the wish of the programmer):

s.split("(?<=[.!?]|[.!?][\'\"])\\s+");

But we still have the abbreviations. What to do when a point followed by space does not indicate the end of a sentence, but rather an abbreviation (Mrs. for Madam, Mr. for Lord, Dr. for Doctor, etc)?

Mrs. Pereira met Geoge W. Bush.

There we used a Negative lookbehind regex:

String pattern = "(?<=[.!?]|[.!?][\'\"])(?<!Sr\\.|Sra\\.|Dr\\.|Dra\\.|W\\.|)\\s+";

Note that Regex has already started to get complicated, and the ideal is to put the abbreviations in a separate structure to check them one by one. More complex cases may arise (e.g., U.K.) that need to be treated.

In short, you may be quite sophisticated in your code, but consider that this is a Processing of Natural Language, and there is no perfect solution yet. The best algorithms reach 90% to 99% accuracy depending on the text.

If you need a more robust and accurate solution, I suggest searching for the Stanford NLP Parser which has algorithms in Java for this.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.