Solution (Ecmascript 2018 / ES9):
.*?[.!?](?![.!?])(?<!\b\w\w.)
Demonstration:
var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/.*?[.!?](?![.!?])(?<!\b\w\w\.)/g);
console.log(frases);
Explanation:
First of all, I warn you that this regex has the limitation of letting go abbreviations of only two letters (e.g. "Dr.", "Mr.", "Fr.", "Mr." etc).
.*?[.!?]
- Here we are capturing any text piece finished with endpoint, exclamation or interrogation. I use a Lazy quantifier to capture each part separately.
(?![.!?])
- This is a Negative Lookahead. Here we are saying not to accept a match if in front of him there is one of these scores (I used to capture also the repeated scores, as in the excerpt Adoro animais!!
).
(?<!\b\w\w.)
- This is a Negative lookbehind. Here we are saying not to capture when our match ends with a \b
(represents a word separator) and two more characters of the type \w
(which means the same as [a-zA-Z0-9_]
). This will make texts like Dr. José
are still considered within the same sentence, but will continue to separate if something like Dra. Maria
.
That is the idea of this regular expression. However, if we want to improve, such as removing the remaining spaces at the beginning of the separations, we can add another Negative Lookahead to ignore spaces:
(?! ).*?[.!?](?![.!?])(?<!\b\w\w.)
And instead of trying to generalize all cases of abbreviations, you might want to insert each specific case into that Negative lookbehind from before:
(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)
Final result:
var paragrafo = "Sras. e Srs., eu sou Dr. José. Minha esposa é a Dra. Maria. Meu passatempo é assistir séries. Adoro animais!! E vocês?";
var frases = paragrafo.match(/(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)/g);
console.log(frases);
I hope I’ve helped.
Updating:
The solution presented above uses the new lookbehind implemented in ES9. As the OP, in comment below, said that it is using a browser that does not yet support this implementation, also I present a solution that does not use lookbehind:
(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])
Explanation:
(?! )
- Is a Lookahead that I used to not capture the spaces left behind the sentences.
(.*?(\b\w\w\.))*
- Here I capture any character until I reach the exceptions. I defined as a general rule the same pattern explained before (\b\w\w\.
), but you also have the option to add the exceptions separately as in the example with the lookbehind. This pattern is placed in a capture group, and I place a quantifier *
after it, to say that it can repeat itself zero or more times.
.*?[.?!]
- Here all characters are captured using a Lazy quantifier, to the nearest endpoint, exclamation or interrogation.
(?![.?!])
- This is a Negative Lookahead. I mean I don’t want no match followed by a score. I use to capture phrases like Adoro animais!!
.
Demonstration:
var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])/g);
console.log(frases);
Well, you must have a defined criterion to use some code to do this. In your example with
split
would be perfect, but the code cannot "know" that the point after "Dr" should not be made thesplit
. So it gets complicated, since point just doesn’t solve in your case.– Ricardo Pontual
What do you want to do it for? Oce may be having the wrong solution to a problem
– Julio Henrique
I understood, I researched a little and I saw that I can do this with regex, I would know a way where I can add the exceptions, with Dr.?
– Juliano
In a general context, I need to take paragraphs from a PDF (this part I already got), separate them by sentence and then search a database if there is a record of that phrase there.
– Juliano
This can only be an example and in the real case you will have more excesses. You can normalize the excesses before performing the
split
. Like before you dosplit
trade in Dr. for Doctor only. Will have problems with...
in sentences too, another excesses. And so on. Then one step before thesplit
eh create a normalize withreplace
.– BrTkCa
Yes, I think I can use this, the problem is that I will have MANY exceptions because Pdfs will come in several languages... But I think this is the only way out.
– Juliano