Regular expression to take one or more occurrences that precede and follow a given letter

Asked

Viewed 68 times

3

I have the following cases...

a first text as follows::

let text = "olá meu numero é trezentos e vinte e quatro tudo bem?"

And this text I want to receive can also be as an example:

let text = "olá meu número é quarenta e três tudo bem?

I tried to make a regular expression that picks up the occurrences before and after the e to pick up the words you have before and after the e. I did it this way:

console.log(text.match(/(\w+\se\s\w+)/g))

But the output I have in the first case of text is 'trezentos e vinte', so that ignores the second group of e. I wish the way out in this case was 'trezentos e vinte e quatro' and if I passed the second text were 'quarenta e três' the same exit, but I’m having a lot of trouble with this Regex, could help me?

  • 1

    If the phrase changes to "olá eu e ela temos o numero é trezentos e vinte e quatro tudo bem?" will give problem.

  • 2

    I’m upping the answers, but I think the problem is much bigger than regex is NLP.

  • 1

    Well placed Ugusto, I did a search now in nlp and there are some libraries in Ode that do several negotiations for what I understand right? I’ll dig a little deeper into the subject, thank you very much!!

3 answers

5


As pointed out in the comments, the solution below will bring false positives if the phrase has any words other than numbers and has an "and" between them. So see if that’s what you really need...


First, the short cut \w does not pick up accented characters, so its regex does not even work for "forty-three":

let text = "olá meu número é quarenta e três tudo bem?";
console.log(text.match(/(\w+\se\s\w+)/g)); // [ 'quarenta e tr' ]

For this you can use the flag u (see here for more details), or put the accents in the regex. And to repeat the part "and ...." just put another one quantifier around him:

let text = "olá meu número é trezentos e quarenta e três tudo bem?";

// com a flag "u" e Unicode Properties
console.log(text.match(/\p{L}+( e \p{L}+)+/ug)); // [ 'trezentos e quarenta e três' ]

// colocando os acentos na regex
console.log(text.match(/[a-záéíóúâêôãõ]+( e [a-záéíóúâêôãõ]+)+/ig)); // [ 'trezentos e quarenta e três' ]

With the flag u, used Unicode Properties (in the case, \p{L} takes any letter defined by Unicode - which are the ones in the categories starting with "L" from this list). Another detail is that this also ends up being too comprehensive, taking letters from other alphabets (Japanese, Arabic, etc). If you want to restrict to our alphabet, you can switch \p{L} for \p{Script=Latin}.

In the second case, I put the accented letters in regex. I only put lower case, but I used the flag i, then it will also consider uppercase letters. This is an alternative if your environment does not yet support Unicode Properties.

And in the part that has " and etc...", I put another quantifier around - in this case, +, because I understood that this part repeats once or more times. And I changed \s by a simple space (since \s also picks up line breaks and other characters).

In addition, there is another difference: the \w also takes the digits from 0 to 9 and the character _. Already the regex I did above only take letters (which seems to make more sense in your case).


On regex and accents, see also here and here.

  • Thank you so much for your help, I don’t have much knowledge in regex so I was lost, but you saved me, thank you so much!

3

This is because regular expression disregards which sequences of e can come one after the other.

That way, even though this works:

/(\w+\se\s\w+)/g

foo aaa e bbb bar baz ccc e ddd qux

This wouldn’t work:

/(\w+\se\s\w+)/g

foo aaa e bbb e ccc qux

Because the regular expression /(\w+\se\s\w+)/g does not determine the match two sequences in a row from each other. That’s because it requires before and afterward of e. In case you’ve already given match in an expression immediately prior to the next e, will have nothing "before" it, so that the match will be impossible for not meeting this condition.

One solution is to indicate that any term after the e can be repeated within a single match. An option would be like this:

/\w+(?:\se\s\w+)+/g

foo aaa e bbb bar baz ccc e ddd e eee qux

See on Regex101.

Although the above regular expression works for cases where words are formed by ASCII alphanumeric characters, accented letters (such as é, á, à etc) are not covered by \w.

So you can change the expression to:

/\p{L}+(?:\se\s\p{L}+)+/gu

foo aáà e bbb bar baz ccc e ddd e eéè qux

See on Regex101.

So that by using the flag u, can be used \p{L}, that captures any letter defined by the Unicode standard - which includes the aforementioned accented characters.

While already well supported, some environments may not implement regular expressions with flag unicode. In such cases, for alternatives to \p{L} with the flag u, consult the another answer.


Not related to the answer, but it is worth noting that the regular expression original of the question (/(\w+\se\s\w+)/g) could be replaced by /\w+\se\s\w+/g, since the capture group in this case does nothing.

2

In this case what you need is to use (\w+\se\s\w+) as a group and place a quantifier. See:

let text1 = "olá meu numero é trezentos e vinte e quatro tudo bem?";
let text2= "olá meu número é quarenta e três tudo bem?"

console.log(text1.match(/\w+(\se\s[a-zê]+)+/g))
console.log(text2.match(/\w+(\se\s[a-zê]+)+/g))

  • ah ok. I had put [a-zê] just to take this case. But your solution is much more general. Thank you for the touches.

  • Okay. I edited the answer

Browser other questions tagged

You are not signed in. Login or sign up in order to post.