Separating a paragraph by sentences

Asked

Viewed 717 times

2

I need to break a paragraph into a set of sentences.

For example:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.split('.')

But he ends up breaking the word Dr. also:

var array = [
      "Sou Dr."
      " José."
      "Meu passatempo é assistir séries."
      "Adoro animais!! E você?"
    ];

What I hope will return is:

var array = [
  "Sou Dr. José."
  "Meu passatempo é assistir séries."
  "Adoro animais!!"
  "E você?"
];
  • Well, you must have a defined criterion to use some code to do this. In your example with split would be perfect, but the code cannot "know" that the point after "Dr" should not be made the split. So it gets complicated, since point just doesn’t solve in your case.

  • What do you want to do it for? Oce may be having the wrong solution to a problem

  • I understood, I researched a little and I saw that I can do this with regex, I would know a way where I can add the exceptions, with Dr.?

  • In a general context, I need to take paragraphs from a PDF (this part I already got), separate them by sentence and then search a database if there is a record of that phrase there.

  • This can only be an example and in the real case you will have more excesses. You can normalize the excesses before performing the split. Like before you do split trade in Dr. for Doctor only. Will have problems with ... in sentences too, another excesses. And so on. Then one step before the split eh create a normalize with replace.

  • Yes, I think I can use this, the problem is that I will have MANY exceptions because Pdfs will come in several languages... But I think this is the only way out.

Show 1 more comment

1 answer

7


Solution (Ecmascript 2018 / ES9):

.*?[.!?](?![.!?])(?<!\b\w\w.)

Demonstration:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/.*?[.!?](?![.!?])(?<!\b\w\w\.)/g);
console.log(frases);

Explanation:

First of all, I warn you that this regex has the limitation of letting go abbreviations of only two letters (e.g. "Dr.", "Mr.", "Fr.", "Mr." etc).

  • .*?[.!?] - Here we are capturing any text piece finished with endpoint, exclamation or interrogation. I use a Lazy quantifier to capture each part separately.
  • (?![.!?]) - This is a Negative Lookahead. Here we are saying not to accept a match if in front of him there is one of these scores (I used to capture also the repeated scores, as in the excerpt Adoro animais!!).
  • (?<!\b\w\w.) - This is a Negative lookbehind. Here we are saying not to capture when our match ends with a \b (represents a word separator) and two more characters of the type \w (which means the same as [a-zA-Z0-9_]). This will make texts like Dr. José are still considered within the same sentence, but will continue to separate if something like Dra. Maria.

That is the idea of this regular expression. However, if we want to improve, such as removing the remaining spaces at the beginning of the separations, we can add another Negative Lookahead to ignore spaces:

(?! ).*?[.!?](?![.!?])(?<!\b\w\w.)

And instead of trying to generalize all cases of abbreviations, you might want to insert each specific case into that Negative lookbehind from before:

(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)

Final result:

var paragrafo = "Sras. e Srs., eu sou Dr. José. Minha esposa é a Dra. Maria. Meu passatempo é assistir séries. Adoro animais!! E vocês?";
var frases = paragrafo.match(/(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)/g);
console.log(frases);

I hope I’ve helped.

Updating:

The solution presented above uses the new lookbehind implemented in ES9. As the OP, in comment below, said that it is using a browser that does not yet support this implementation, also I present a solution that does not use lookbehind:

(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])

Explanation:

  • (?! ) - Is a Lookahead that I used to not capture the spaces left behind the sentences.
  • (.*?(\b\w\w\.))* - Here I capture any character until I reach the exceptions. I defined as a general rule the same pattern explained before (\b\w\w\.), but you also have the option to add the exceptions separately as in the example with the lookbehind. This pattern is placed in a capture group, and I place a quantifier * after it, to say that it can repeat itself zero or more times.
  • .*?[.?!] - Here all characters are captured using a Lazy quantifier, to the nearest endpoint, exclamation or interrogation.
  • (?![.?!]) - This is a Negative Lookahead. I mean I don’t want no match followed by a score. I use to capture phrases like Adoro animais!!.

Demonstration:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])/g);
console.log(frases);

  • but in javascript there is no Negative lookbehind ;/

  • @Peace The lookbehind on regular expressions in Javascript is on stage 4 of TC39, which means that this functionality has already been implemented in Ecmascript. Even, is now available from version 62 of Chrome.

  • I had no idea... +1

  • Sounds like a great solution! But when I step into the browser it returns me Syntaxerror: invalid regexp group. I’m using Firefox

  • @Juliano Unfortunately Firefox does not yet support the implementation of Javascript lookbehind, as this is a recent feature. This regular expression will only work for Google Chrome from version 62.

  • Is there a way to escape the numbers as well? Example: 190.923. : (?! ).*?.!?(?<!\Bdr.| Dra.| Srs.| Sras.| [0-9]. [0-9]. ) But it didn’t work

  • Works if you only use \d\. instead of [0-9]\.[0-9]\..

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.