Split per point with exceptions

Asked

Viewed 151 times

3

I have the following regex:

(?! ).*?[|.!?:;\b\t\n](?![|.!?:;\b\t\n]|(\.\d))(?<!\bDr\.|\bINC\.|\binc\.|\bInc\.|\bNO\.|\bNo\.|\bno\.|\bN\.|\bn\.|\bReg\|\breg\.|\bREG\.|\bCo\.|\bDra\.|\bSrs\.|\bSr\.|\bSra\.|\bSra\.|\bFl\.|S\.|A\.|\bSras\.|\&|\&amp;\d\.)

In the sentence: "I have 190,000 points, my email is [email protected]. Thank you very much."

It turns an array as follows:

Tenho 190.
000 pontos, meu e-mail é lorem.
ispum@dolor.
com.
br.
Muito obrigado.

When I really need to escape the point between two letters and numbers.

What I hope is:

Tenho 190.000 pontos, meu e-mail é [email protected].
Muito Obrigado.
  • I did not understand well no. It seems to me that you just want to separate the "Thank you" from the rest of the sentence. Or else it was not well explained the question. Use pattern-based regex, what is the default? This I don’t understand well.

  • The phrase was just an example, I need to explode a string in "." but I need to escape when that point is between 2 numbers/letters

  • I know, but before "thank you," there’s a space, which is not a letter or a number.

  • So he must be blown up there.

  • Isn’t it easier to see the rule as any point followed by space? A simple regex like this: \.\s solves the problem.

  • You need to continue following the regex rule, breaking also by |.!?: ; b t n and with its due exceptions Dr., Dr., INC. and etc...

  • @Maxfratane O \s also corresponds newline character, so I think only space would be better \.( ) and replacing the space in the subgroup by a new line \n

Show 2 more comments

1 answer

1

You could even use something like split(/\.\s/) (separate by a point followed by space), but the problem is that the split will remove the point from the first string (and by the way you want to keep it).

So the way is to use one lookbehind:

let frase = "Tenho 190.000 pontos, meu e-mail é [email protected]. Muito obrigado.";
console.log(frase.split(/(?<=\.)\s/));
// [ "Tenho 190.000 pontos, meu e-mail é [email protected].", "Muito obrigado." ]

A feature of lookbehind is that it checks if something exists before a given string position, but the verified chunk is not part of the match, and so it is not removed in the split.

In the above case, the lookbehind is (?<=\.) - that is, it checks whether there is a point (\.) before the current position. And the current position, in this case, is \s (which corresponds to several characters, such as space, TAB or line breaks - see the full list in the documentation).

This ensures that the break will be done in the spaces, but only those that have a point before. As the point is in a lookbehind, he is not part of the match and so is not removed by split - already the space is removed, so the result is the array:

[ "Tenho 190.000 pontos, meu e-mail é [email protected].", "Muito obrigado." ]

You could also use the regex /(?<=\.)[^a-z\d]/i: in the case instead of \s, I’m using [^a-z\d] (all that nay is a letter from a to z or a digit (\d), and the flag i indicates that regex will be case insensitive (that is, whether they are uppercase or lowercase letters). Depending on your use cases, it may make more sense to use this option (which will pick up anything other than letter or number, which includes punctuation marks, hyphen, etc.) or \s (that only takes the spaces and line breaks). Choose the one that best suits your use cases.


But it’s not over

Although you do not have an example, you cite that there are exceptions, such as "Dr.", "INC." and others, that should not be part of the split.

In this case, the regex gets a little more complicated. For example, to consider "Dr.", "Dr." and "INC." as exceptions:

let frase = "Tenho 190.000 pontos, meu e-mail é [email protected]. Muito obrigado, Dr. Fulano.";
let partes = frase.split(/(?<=(?<!dra?|inc)\.)\s/i);

console.log(partes);
// [ "Tenho 190.000 pontos, meu e-mail é [email protected].", "Muito obrigado, Dr. Fulano." ]

Now I wear one too Negative lookbehind (the stretch (?<! ... )), that something checks out nay exists before a given position. In this case, the expression is dra?|inc: the string "dr", followed by an optional "a" (a? means that the a is optional), or (the "or" is indicated by |) the string "inc". The flag i ensures that it can be either "DR." or "Dr.", or "Dr.", etc.

That means that the split is done in spaces (\s), as long as they have a point before, but that point cannot have before it one of the quoted strings ("Dr", "Dr", "Inc").

To put all your conditions, it would look like this:

frase.split(/(?<=(?<!dra?|sra?s?|inc|reg|co|bn)\.)\s/i)

Now I have several options (all separated by |):

  • dra?: the letters "dr" followed by an optional "a" which serves both "dr" and "dr"
  • sra?s?: the letters "sr" followed by an optional "a", followed by an optional "s", so it serves for "sr", "Srs", "sra" and "Sras"
  • other options ("inc", "reg", etc)

If you want to add more expressions, just go to regex (always separating by |).


Compatibility and alternatives

Unfortunately the Negative Lookahead is not a feature that is available in all browsers. But if you want, it is possible to simulate it (with a lot of "gymnastics").


Another alternative (which also uses Negative lookbehind) is to make a match instead of a split, after all, match and split are two sides of the same coin. In the split you say what you don’t want to be in the end result, already in the match you say what you want:

let frase = "Tenho 190.000 pontos, meu e-mail é [email protected]. Muito obrigado, Dr. Fulano de Tal.";
let regex = /.+?(?<=(?<!dra?|sra?s?|inc|reg|co|bn)\.)(?:\s|$)/ig;

while ((matches = regex.exec(frase)) !== null) {
    console.log(matches[0].trim());
}

The regex is similar, but now we put .+? (one or more characters). That is, I want to pick up several characters, until I find a space (\s), provided that this space has a point before (but provided that this point does not have before "Dr", "Dr", etc). The detail is that the quantifier + is greedy and tries to get as many characters as possible (which would make him go to the end of the string, for example). To cancel this behavior and stop at the first space that satisfies the condition, I put the ? soon after.

I also use the flag g, which allows you to traverse the string by searching all pouch, and for each match I use the method trim(), since the match also contains the space at the end. The result is:

Tenho 190.000 pontos, meu e-mail é [email protected]. 
Muito obrigado, Dr. Fulano de Tal.

If you want to store the results in an array:

let frase = "Tenho 190.000 pontos, meu e-mail é [email protected]. Muito obrigado, Dr. Fulano de Tal.";
let regex = /.+?(?<=(?<!dra?|sra?s?|inc|reg|co|bn)\.)(?:\s|$)/ig;
let resultados = [];

while ((matches = regex.exec(frase)) !== null) {
    resultados.push(matches[0].trim());
}

console.log(resultados);

Browser other questions tagged

You are not signed in. Login or sign up in order to post.