Accentuation using regex

Asked

Viewed 123 times

5

I have the following function for name normalization

function normalizaNome(nome) {
    var palavras = nome.match(/\b\w+\b/g),
    preps = ["de", "da", "do", "das", "dos"];
        return palavras.map(function(e,i) {
            return preps.indexOf(e) == -1 || i === 0 ? e[0].toUpperCase()+e.slice(1) : e;
    }).join(" ");
}

But you’re not accepting accented words.

2 answers

10


According to the documentation, the shortcut \w does not consider accented letters.

One simple way to solve is to include accented characters in regex:

var palavras = nome.match(/\b[\wáéíóúâêîôûãõç]+\b/gi);

I also put the flag i to consider both upper and lower case, otherwise the regex would have to be áÁéÉ etc....

There is also the option to use /\b[\wà-ÿ]+\b/gi, for the interval à-ÿ already includes several accented characters (see here), but it will also accept some more characters, such as ÷ (DIVISION SIGN), among others that are not accented letters (see link already indicated to see all characters).

It is worth remembering that the shortcut \w also considers digits from 0 to 9 and the character _. If you want to consider only letters, just change the regex to:

/\b[a-záéíóúâêîôûãõç]+\b/gi

Alternative (yet not compatible with all browsers) is to use Unicode Property Escapes:

var palavras = nome.match(/\b\p{L}+\b/gu);

In the case, \p{L} are all characters of "Letter" categories defined by Unicode (are all categories starting with "L" from this list). One detail is that regex needs flag u for this shortcut to work.

This makes the regex a little more comprehensive as it will consider letters from other alphabets as well, such as Japanese, Arabic, Cyrillic, etc. On the other hand, it does not consider digits and the _.

Another option to accept only letters of our alphabet, ignoring others like Japanese, Arabic, etc., is:

nome.match(/\b\p{Script=Latin}+\b/gu)
  • 1

    I can understand perfectly, thank you for your help.

-4

Have you ever tried something like

var words = name.match(/( w+)( D+)/g);

?

\w All graphic characters and \D every digit (including accents).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.