Regex to accept accents in first letter and last letter of first and last name

Asked

Viewed 475 times

0

I’m trying to create a regex in Javascript that accepts letters and accents in the first and last letter:

var exp = /^((\b[A-zÀ-ú']{2,40}\b)\s*){2,}$/gm;
    var re = new RegExp(exp);

    if(nome.match(re))...

Example: Álvaro Silva, Ítalo José or Érica Santos.

  • @hkotsubo Not because I need you to have a surname and in those Regex I valid the accents but do not force the placement of the surname, but thank you!

  • Actually, yesterday I thought the problem was just knowing how to find the accented characters. But investigating better I saw that it was something else (I left an answer below)

1 answer

2

The problem with your regex is shortcut \b. It indicates a word Boundary ("word border"), that is, a position that has an alphanumeric character before and a non-alphanonumeric character after (or vice versa, see here a more detailed explanation).

The problem is that the definition of what is alphanumeric varies widely between languages, API’s and Engines regex. Some consider all letters and digits defined by Unicode (that is, they encompass all the alphabets existing in the world), while others are restricted to the ASCII (only letters from "a" to "z" in upper and lower case, digits from 0 to 9, and the character _). In the case of Javascript, guess only, accented characters are not considered "word characters".

You can do a simple test:

console.log('Álvaro'.split(/\b/)); // [ 'Á', 'lvaro' ]

The split is made in the \b (that is, in the position in which there is an alphanumeric character before and a non-alphanonumeric character after, or vice versa). And as the accented character Á is not considered alphanumeric, and soon after it has an alphanumeric (the letter l), break was made in this position.


So, \b will not work in your case. If the string starts with an accented character, the position before it (the beginning of the string) does not match \b (because as explained here, the beginning of the string is considered to be a "position that does not have an alphanumeric character", so an accented character at the beginning is considered a position that has a non-alphanumeric character before (the beginning of the string) and a non-alphanumeric character after (the accented letter)).

The way is not to use \b, and instead indicate that after the letters you can only have spaces or the end of the string:

let exp = /^([a-zA-ZÀ-ú']{2,40}(\s+|$)){2,}$/;
[ 'Álvaro Silva', 'Ítalo José', 'Érica Santos', 'SemSobrenome' ].forEach(s => {
    console.log(`${s}: ${exp.test(s)}`);
});

After the name I put (\s+|$) (the quantifier + indicates "one or more" and the character | indicates alternation, then this stretch means "one or more spaces or the end of the string"). You were using \s*, which means "zero or more spaces," which could take cases that don’t have a last name. Forcing at least one space I guarantee that there has to be something after (and the quantifier {2,} ensures that this should be repeated at least twice). With this, I accept names with accents, and force to have at least one surname.

Note that when creating an expression with /^etc.../, you already have an instance of RegExp, for bar expression is the literal way of creating a regex. You don’t have to do new RegExp(exp), for exp is already a regex and you can use it directly. I also saw that you used string.match(regex), but if you just want to know if the string matches regex (only "yes" or "no"), you can use regex.test(string) (match returns an array containing the search results, while test returns only true or false, that seems to be what you need).

I also switched A-z for a-zA-Z. That’s because the break A-z ends up picking up other characters that are not letters, as the [ and \, for example. See the difference:

let s = "a[\\]b";
console.log(s.match(/[A-z]+/g)); // [ 'a[\\]b' ]
console.log(s.match(/[a-zA-Z]+/g)); // [ 'a', 'b' ]


I also removed the flags g and m, because they don’t seem necessary. A flag g serves to fetch all occurrences in a string (because by default, only the first occurrence is returned). If the string has more than one name and you want to fetch them all, then it makes sense to use it. If the string has only one name, it makes no difference to have the g.

And the flag m is used to change the behavior of markers ^ and $. By default they match only the beginning and end of the string, but with the flag m They also consider the beginning and end of a line. If you are searching in a string that contains a name per line, then it makes sense to use it (along with g if you want all occurrences). But if everything is in a single line, then you don’t need the flag or the markers. Ex:

let exp = /^([a-zA-ZÀ-ú']{2,40}( +|$)){2,}$/gm;
let nomes = 'Álvaro Silva\nÍtalo José\nÉrica Santos\nSemSobrenome';
console.log(nomes.match(exp)); // [ 'Álvaro Silva', 'Ítalo José', 'Érica Santos' ]

In the example above, each name is on a line, and I used the flags m (why ^ and $ also consider the beginning and end of the line) and g (why match return all occurrences). I just had to change the \s by a space (note that there is a space before the +), because \s also considers other characters, including line breaks, so he was considering that the entire string was a single name. Replacing it with spaces I avoid this problem.

Now if the string only has a single name, the flags won’t make a difference.


Another point is that you mentioned that you should accept accents in the first and last letter. Well, if it is only in the first or last, then I would have to adapt a little:

let exp = /^([a-zA-ZÀ-ú][a-zA-Z]{0,38}[a-zA-ZÀ-ú]( +|$)){2,}$/gm;
let nomes = 'Álvaro Silva\nÍtalo José\nÉrica Santos\nSemSobrenome\nAcentos Estão no Meio';
console.log(nomes.match(exp)); // [ 'Álvaro Silva', 'Ítalo José', 'Érica Santos' ]

Now I put the accents only in the first and last letter, and between them I leave only the letters not accented. I changed the amounts to {0,38}, because before you were using {2,40} (at least 2, at most 40), but since the first and last letter are now explicitly placed, the middle will be between 0 and 38 characters.

Another detail is that the interval À-ú includes some characters that are not accented letters, such as Æ and the ÷ (DIVISION SIGN), in addition to leaving out the ü (which is used in German names, for example) - see here the full list. Also, your regex has an apostrophe in the middle there, I don’t know if it makes sense, so I removed.

For a regex that contemplates accents, there are other listed alternatives here (it is worth mentioning the possibility of "oddities" as the Unicode normalization, and if you want to deepen, see here).


Another alternative is to use lookarounds to check if something exists before or after a certain stretch:

let exp = /^(((?<![a-zA-ZÀ-ü])[a-zA-ZÀ-ü]{2,40}(?![a-zA-ZÀ-ü]))( +|$)){2,}$/gm;
let nomes = 'Álvaro Silva\nÍtalo José\nÉrica Santos\nSemSobrenome\nAcentos Estão no Meio';
console.log(nomes.match(exp)); // [ 'Álvaro Silva', 'Ítalo José', 'Érica Santos', 'Acentos Estão no Meio' ]

The excerpt (?<![a-zA-ZÀ-ü]) is a lookbehind negative, which checks whether something nay exists before a given position. In this case, I am checking if there is no letter before. Similarly, (?![a-zA-ZÀ-ü]) is a Lookahead negative, which checks whether something nay exists after.

That is, I look for the letters, since before and after there is no letter, whether it is accentuated or not (is a way to simulate the \b, but also considering the accentuated letters).


Another option (which is not yet compatible with all browsers, then evaluate if it makes sense in your case) is to use Unicode Property escapes:

// aceita acentos em qualquer parte do nome
let exp = /^(\p{L}{2,40}( +|$)){2,}$/ugm;
let nomes = 'Álvaro Silva\nÍtalo José\nÉrica Santos\nSemSobrenome\nAcentos Estão no Meio';
console.log(nomes.match(exp)); // [ 'Álvaro Silva', 'Ítalo José', 'Érica Santos', 'Acentos Estão no Meio' ]

// aceita acentos apenas no início e fim
exp = /^((\p{L}\p{M}*)\p{L}{0,38}(\p{L}\p{M}*)( +|$)){2,}$/ugm;
console.log(nomes.normalize('NFD').match(exp)); // [ 'Álvaro Silva', 'Ítalo José', 'Érica Santos' ]

Basically, \p{L} accepts any letter defined by Unicode, including accents. Note that you need flag u, which enables "Unicode mode".

Already to accept the accents only at the beginning or end, I used the normalization (see more details on this reading here, here and here). But basically, by normalizing to NFD, letters like Á are "broken" in two: the letter A without accent and the accent itself. So, \p{L} takes the letter and \p{M} picks up the accents (zero or more, in this case). Already in the middle of the string I do not consider the accents, so in the end the regex only accepts accents at the beginning and end of the name.

Only now it’s become too comprehensive, because \p{L} considers several other alphabets, such as Japanese, Arabic, Cyrillic, etc (basically, all of which belong to the categories starting with "L" from this list). If you want to consider only our alphabet (and the browser supports this feature), you can exchange \p{L} for \p{Script=Latin}.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.