What is the regex for all accented, whole words coming from a variable?

Question

What is the regex for all accented, whole words coming from a variable?

Asked 3 years, 10 months ago

Viewed 68 times

-1

Starting from a variable var palavra = 'nascer' (for example) which regular expression takes all the words whole of a text? Including words that have an accent at the beginning or end.

Considerations:

Word is not substring - this means that "nasceram".includes("nascer") returns true; for me it doesn’t work because I consider words and not substrings (did you understand? qq thing tell me in the comment)
I am taking care not to consider white spaces and punctuation (before and after) because the variable palavra only have the 'word' in fact: nothing before, nothing after.
I used boundaries that’s pretty decent, but the boundaries don’t pick up accent

I did it:

var palavra = 'nascer'

// a regex que pega todas as palavras 'nascer'
const regex = new RegExp(`\\b${palavra}\\b`, 'g');

With this 'boundaries' and the 'global'. It works. It takes all words, and integers, without derivatives like 'born' or 'born'.

but boundaries does not accept accent. so it does not take when the variable starts or ends with accent

var palavra = 'água'
//ou
var palavra = 'café'

In the above cases it does not solve;

How to improve this regex to select (from variable) whole words, including with accent at beginning or end?

I tried something with ^ and the $ but it did not happen

/^[A-Za-záàâãéèêíïóôõöúçñÁÀÂÃÉÈÍÏÓÔÕÖÚÇÑ ]+$/
//ou isso
/^[a-záàâãéèêíïóôõöúçñ ]+$/i

but I don’t know how to put a variable as a selection criterion

1 answer

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2021-09-01T20:02:06+00:00

In Javascript the shortcut \b only consider ASCII characters, so accented characters are not taken into account.

An alternative is to simulate your behavior using lookarounds along with Unicode Property Escapes:

const regex = new RegExp(`(?<!\\p{L})${palavra}(?!\\p{L})`, 'ug');

First we have a negative lookbehind: the stretch within (?<! )) indicating that before the word there is no letter (where \p{L} is any letter defined by Unicode, which includes numerous different alphabets, such as Japanese, Arabic, etc., and also accented characters).

And in the end we have a Lookahead negative: the stretch within (?! ), indicating that after the word there is no letter.

I mean, it’s like the "unicode version" of \b (well simplified, because in fact the shortcut \b is a little more complicated than that).

Note that, so that the \p work, need to put the flag u.

Testing:

function testar(palavra, frase) {
    const regex = new RegExp(`(?<!\\p{L})${palavra}(?!\\p{L})`, 'ug');
    // troca a palavra por X, para vermos se pegou as ocorrências corretas
    console.log(frase.replace(regex, 'X'));
}

testar('água', 'tem água nas águas, deságua'); // tem X nas águas, deságua
testar('café', 'toma café mas não nescafé, cafés, café'); // toma X mas não nescafé, cafés, X

Finally, there is the tip to take some precautions when creating a regex whose value is any variable: Creating regular expressions with a dynamic pattern is problematic? If yes, there is a way to avoid the problem?

And for the record, about this regex that you tried:

/^[a-záàâãéèêíïóôõöúçñ ]+$/i

It doesn’t work because it means the following:

the markers ^ and $ indicate the start and end of the string
then there’s the character class [a-záàâãéèêíïóôõöúçñ ], that picks up letters with and without accent and space
and the quantifier + means "one or more"
and the flag i causes you to consider uppercase and lowercase

Therefore, this regex takes one or more occurrences of letters (with and without accent) and spaces, from the beginning to the end of the string. Which means that it only gives match if the entire string has only these characters. Just have some different (like a punctuation mark, for example) that she no longer finds any match. I mean, something very different than what you need.