19
How to validate people’s names, in Brazilian Portuguese?
19
How to validate people’s names, in Brazilian Portuguese?
24
The Portuguese alphabet is based on latin alphabet, consisting of 26 characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Added to these characters, the Portuguese alphabet (from Brazil) adds the following diacritical symbols:
~
(Til): nasalizes the vowel "a" and the diphthongs "ae", "Oe" and "ao" -ã / ãe / õe / ão.¸
(Cedilla): gives the letter "c" the sound of the letter "s" in front of "a", "o" and "u" -ç.^
(Circumflex Accent): indicates the stressed syllable and closes the timbre of the vowels "a", "e" and "o", in cases where graphic accentuation -- â/ ê / ô is required.´
(Acute Accent): specifies the stressed syllable and opens the timbre of vowels in cases where graphic accentuation is required -- will / is / í / ó / ª.`
(Grave Accent): used to mark the dative female case (à), as opposed to "to" (male), and the pronouns "that", "that" and "that" - to.¨
(Trema): currently used only in Brazilian Portuguese to indicate the pronunciation of the vowel "u" in the sequences "qüe", "qüi", "Güe" and "Güi" - ü.
So beyond the traditional track a-z
and A-Z
, we must also include the characters ãõ
, ç
, âêô
à
, áéíóú
and ü
. And of course, we can’t forget the white space.
The regex would be:
[^a-zA-ZáéíóúàâêôãõüçÁÉÍÓÚÀÂÊÔÃÕÜÇ ]
Still remember to:
public static string TratarNome(string nome)
{
if (string.IsNullOrWhiteSpace(nome)) throw new ArgumentException("Um nome em branco foi passado.");
// Removendo caracteres em branco no ínicio e no final do nome:
nome = nome.Trim();
// Trocando dois ou mais espaços em branco consecutivos por apenas um:
nome = Regex.Replace(nome, "[ ]{2,}", " ", RegexOptions.Compiled);
// Verificando a ocorrência de caracteres inválidos no alfabeto português (do Brasil):
if (Regex.IsMatch(nome, "[^a-zA-ZáéíóúàâêôãõüçÁÉÍÓÚÀÂÊÔÃÕÜÇ ]", RegexOptions.Compiled)) throw new ArgumentException("Nome inválido: \"" + nome + "\".");
return nome;
}
I ran the above code on a basis of tens of thousands of Brazilian names (around 100,000).
Of these I obtained the following false positives:
ñ
: PEÑA, CAMIÑA, YÁÑEZ, MUÑOZ and MUÑIZ.'
: SAINT'CLAIR.-
: SAINT-CLAIR.In addition to our colleague’s name @jpkrohling:
ö
: KRÖHLING.Another curiosity is that a few records are blank NBSP (160), instead of the white space common SP (32). The validation also detected this (and, in our case, we decided to replace).
Treating names, especially internationally, is not a simple task. The above treatment would fail as relatively common names like Björk, Marić; or not as common as Graham-Cumming.
Also, when being more permissive, beware of a possible breach of an XSS attack. An example would be the use of apostrophe. Some names use the apostrophe, which is often represented (erroneously?) by the single quote character ('
) instead of the correct character (’
).
Be advised.
+1 by xkcd, although I see little practical application in this type of validation, as even cities can have names with different characters like '
(apostrophe) and -
(trait).
@utluiz, yes but the idea is to validate name of persons. I just (really) needed to implement this validation in a system yesterday, I found it interesting to document here what I ended up researching/ implementing :)
Another international example that fails is Muñoz and the like. I was curious, why validate only Brazilian names and veto international? A sanitization against XSS and SQL Injection would not be enough?
@Talles You could satisfy our curiosity and reveal the reason for requiring this type of validation for the system? =)
Note that some characters are invalid for grammar rules, but are perfectly valid for names. For example, ö (the one with the trema), which appears in my official documents (RG, CPF, passport).
In this specific system, it is important that the real name registered users are well trained. Because there were many users and integration with other systems, the decision was made to be rigorous in this register. We decided to treat any exceptions as they occur (I can even update the response when this occurs).
But some reasons could be an encoding failure on the part of another integral system or some user playing with Unicode characters for example.
@jpkrohling, interesting, his name would already fail validation. I came to read about a case (passport issue) in which the letter unconventional was replaced by the more similar (ö for the in your case), but apparently this did not happen to you. By price I will see if I can perform the validation on the existing names in our base (which are some beautiful thousands).
@jpkrohling, would you mind telling us what your name or surname has the trema (only it does not need to be your full name). Just to add to the answer.
It’s the last part of my login here, replacing "o" with "ö", so it’s no problem: Kröhling :-)
22
In Brazil there are no restrictions on the names of people. The law only mentions that it cannot expose the person to ridicule, otherwise it is allowed. And yet nothing prevents a foreigner from living in the country and needs to be registered in your system.
In this way it is no use only to provide a validation rule that considers the letters of the Latin alphabet (of A-Z) and their accents. It is also necessary to provide exceptions for apostrophes, hyphens, sequence of Roman numbers (William Gates III, for example), Greek characters... The list of exceptions would be gigantic and would probably leave something out, generating error for a specific user.
There is also the problem of encodings, and depending on the treatment that the related systems give to the characters, a name could be printed in another system in a totally incomprehensible way.
In general, if you prevent a user from registering on your system because their name is not accepted, you are losing a potential customer. Only validate if the field has been filled and does not take risks.
Yeah, well, you usually have to be careful about being restrictive. But in some (exceptional) scenarios assertiveness is more important than quantity (as it was in mine). p.s.: Roman numerals are still letters!
@Perhaps he refers to the fact of sequence of Roman numbers or more precisely the sequence of some characters equal which can be considered erroneous, although in some ways this goes against the idea (which I agree) of being less restrictive.
In the same case, allow or not foreign names with non-Latin characters?
Browser other questions tagged validation
You are not signed in. Login or sign up in order to post.
In Brazil still have to worry about the indigenous names - Validation of name will always be something of much discussion and concern for us programmers, I even stopped to warm up with it some time rsrsrs
– user22610