What is the regex for all accented, whole words coming from a variable?

Asked

Viewed 68 times

-1

Starting from a variable var palavra = 'nascer' (for example) which regular expression takes all the words whole of a text? Including words that have an accent at the beginning or end.

Considerations:

  • Word is not substring - this means that "nasceram".includes("nascer") returns true; for me it doesn’t work because I consider words and not substrings (did you understand? qq thing tell me in the comment)
  • I am taking care not to consider white spaces and punctuation (before and after) because the variable palavra only have the 'word' in fact: nothing before, nothing after.
  • I used boundaries that’s pretty decent, but the boundaries don’t pick up accent

I did it:

var palavra = 'nascer'

// a regex que pega todas as palavras 'nascer'
const regex = new RegExp(`\\b${palavra}\\b`, 'g');

With this 'boundaries' and the 'global'. It works. It takes all words, and integers, without derivatives like 'born' or 'born'.

but boundaries does not accept accent. so it does not take when the variable starts or ends with accent

var palavra = 'água'
//ou
var palavra = 'café'

In the above cases it does not solve;

How to improve this regex to select (from variable) whole words, including with accent at beginning or end?

I tried something with ^ and the $ but it did not happen

/^[A-Za-záàâãéèêíïóôõöúçñÁÀÂÃÉÈÍÏÓÔÕÖÚÇÑ ]+$/
//ou isso
/^[a-záàâãéèêíïóôõöúçñ ]+$/i

but I don’t know how to put a variable as a selection criterion

1 answer

0


In Javascript the shortcut \b only consider ASCII characters, so accented characters are not taken into account.

An alternative is to simulate your behavior using lookarounds along with Unicode Property Escapes:

const regex = new RegExp(`(?<!\\p{L})${palavra}(?!\\p{L})`, 'ug');

First we have a negative lookbehind: the stretch within (?<! )) indicating that before the word there is no letter (where \p{L} is any letter defined by Unicode, which includes numerous different alphabets, such as Japanese, Arabic, etc., and also accented characters).

And in the end we have a Lookahead negative: the stretch within (?! ), indicating that after the word there is no letter.

I mean, it’s like the "unicode version" of \b (well simplified, because in fact the shortcut \b is a little more complicated than that).

Note that, so that the \p work, need to put the flag u.

Testing:

function testar(palavra, frase) {
    const regex = new RegExp(`(?<!\\p{L})${palavra}(?!\\p{L})`, 'ug');
    // troca a palavra por X, para vermos se pegou as ocorrências corretas
    console.log(frase.replace(regex, 'X'));
}

testar('água', 'tem água nas águas, deságua'); // tem X nas águas, deságua
testar('café', 'toma café mas não nescafé, cafés, café'); // toma X mas não nescafé, cafés, X


Finally, there is the tip to take some precautions when creating a regex whose value is any variable: Creating regular expressions with a dynamic pattern is problematic? If yes, there is a way to avoid the problem?


And for the record, about this regex that you tried:

/^[a-záàâãéèêíïóôõöúçñ ]+$/i

It doesn’t work because it means the following:

  • the markers ^ and $ indicate the start and end of the string
  • then there’s the character class [a-záàâãéèêíïóôõöúçñ ], that picks up letters with and without accent and space
  • and the quantifier + means "one or more"
  • and the flag i causes you to consider uppercase and lowercase

Therefore, this regex takes one or more occurrences of letters (with and without accent) and spaces, from the beginning to the end of the string. Which means that it only gives match if the entire string has only these characters. Just have some different (like a punctuation mark, for example) that she no longer finds any match. I mean, something very different than what you need.

  • Um, perfect. Any ideas about dealing with case sensitive? for example testar('água', 'Água nas águas, deságua'); he also exchanges 'Water' for X ? I tried the 'Ugi' option but did not take.

  • 1

    @Lukenegreiros The flag i it should work: https://ideone.com/ZOU3TZ - I don’t know if it is one of those things that vary according to the browser, because nowadays the support for Unicode is already very reasonable, but I would not doubt...

  • my implementation is wrong. Look at this: https://jsfiddle.net/apw1ruy2/ is a program to mark repeated words within a sentence, with more than three letters. And in function marcar_palavra_repetida() down there, I create a fraseHTML and I use the 'word' variable of the parameter to construct this string, so it would never mark two equal words by ignoring the case sensitive (note the second line, the word 'Guess' with 'guess' in the same sentence - I’d like you to also mark).

  • @Lukenegreiros The problem with "Advinha" is that indexOf is not case insensitive: https://jsfiddle.net/4yk6xf83/1/

Browser other questions tagged

You are not signed in. Login or sign up in order to post.