Regular expression validation is invalid even when you find a match

Asked

Viewed 150 times

1

I’d like to know what’s wrong with that code. All valid values I enter in the name, email and phone are invalid, even if in the format requested by regex.

import re

class Contato: 

def __init__(self):
    self.__nome = ''
    self.__email = ''
    self.__telefone = ''

def __validaEmail(self, email):
    result = re.match('(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', str(email))
    if result == None:
        return False

def __validaTelefone(self, telefone):
    result = re.match('(d{2}) d{4,5}-d{4}', str(telefone))
    if result == None:
        return False

def inicializaContato(self, nome, email, telefone):
    if self.__validaTelefone(telefone) and self.__validaNome(nome) and self.__validaEmail(email):
        self.__nome = nome
        self.__email = email
        self.telefone = telefone

def getNome(self):
    return self.__nome
def getEmail(self):
    return self.__email
def getTelefone(self):
    return self.__telefone

def setNome(self, nome):
    str(nome).strip()
    if str(nome).isalpha():
        self.__nome = nome
    else:
        print('\033[1;31m','NOME INVALIDO!','\033[m')

def setEmail(self, email):
    if self.__validaEmail(email):
        self.__email = email
    else:
        print('\033[1;31m','EMAIL INVALIDO!','\033[m')

def setTelefone(self, telefone):
    if self.__validaTelefone(telefone):
        self.__telefone = telefone
    else:
        print('\033[1;31m','TELEFONE INVALIDO!','\033[m')
  • 1

    And what are the apparently valid values you entered and charged as invalid?

2 answers

1


Let’s look at the method that validates email:

def __validaEmail(self, email):
    result = re.match('(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', str(email))
    if result == None:
        return False

When result for None, you return False, but when he’s not None (that is, when the email is valid), you do not return anything. And with that the "return" of the function ends up being None. Example:

def f(x):
    if x == 1:
        return True

print(f(2)) # None

If I pass 2, the function does not enter the if and returns nothing, so the above code prints None.

And how None is interpreted as false, if self.__validaEmail(email) will be testing the same as if None, which is always false.

Therefore, you must change the validation to always return the result of match:

def __validaEmail(self, email):
    return re.match(r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', str(email))

def __validaTelefone(self, telefone):
    return re.match(r'(d{2}) d{4,5}-d{4}', str(telefone))

If the regex finds one match, it returns the corresponding result, and when testing it on if, it will be considered true (objects by default are evaluated as true).

But there is still a problem in the phone regex. You used d{2}, which means "two occurrences of the letter d". If you mean "double digits," you should actually use the shortcut \d.

Another detail is that the parentheses have special meaning in regex (they serve to form capture groups). If you want the phone to have parentheses in fact, you should escape them with \, so they lose their special meaning and are interpreted as the characters ( and ).

So the validation methods look like this:

def __validaEmail(self, email):
    return re.match(r'(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', str(email))

def __validaTelefone(self, telefone):
    return re.match(r'\(\d{2}\) \d{4,5}-\d{4}', str(telefone))

Testing:

c = Contato()
c.setEmail('[email protected]')
print(c.getEmail())
c.setTelefone('(11) 92812-1221')
print(c.getTelefone())

Exit:

[email protected]
(11) 92812-1221


On the regex

The regex used above are relatively simple, but can be improved if applicable.

The phone, for example, accepts values such as (00) 00000-0000. If this is acceptable for your tests, fine, but you can improve and use the options of this question.

Email regex accepts values as [email protected]. Of course it can be improved, and I comment more on this here, here, here and here (this last link has some options at the end, just do not recommend the last regex).

But basically, the simpler the regex, the more "strange" cases it will consider valid, but the more accurate, the more complex it becomes. Choose the one that’s best for you, always remembering to balance practicality (if it’s easy to understand and maintain) and accuracy (it validates what I want, and also doesn’t validate what I don’t want).


About the name

The method isalpha does not accept spaces (and believe that you must have tested with names that have spaces). But as it was not said what values you used to test, we can speculate a little.

We can use something simple like \w+(?: \w+)+. The \w is a shortcut to "letters, numbers or the character _". This regex accepts several \w, followed by (several occurrences of "space followed by several \w"). That is, names containing several surnames, all separated by space.

Of course, because it’s very simple, this regex accepts things like 123 a b_. And here is the same story of emails and phones: the more accurate the regex, the more complicated it gets.

We can eliminate the numbers and the _ and keep only the letters using [^\W\d_]:

  • \W is "anything that is not \w"
  • \d is any numerical digit
  • _ is the very character _

The [^ serves to deny all this. That is, the result is the same as "\w without the numbers and the _", left only the letters. I did this because in Python 3 \w considers all letters defined in Unicode, which includes accented characters and other alphabets (such as Japanese, Arabic, Cyrillic, etc.). Then it would look like this:

def setNome(self, nome):
    if re.match(r'^[^\W\d_]+(?: [^\W\d_]+)+$', str(nome).strip()):
        self.__nome = nome
    else:
        print('\033[1;31m', 'NOME INVALIDO!', '\033[m')

But if you want to limit yourself to the Latin alphabet and accented characters, you can use something like [a-zçáéíóúãõâêô....] (include all the characters you want between brackets). And to avoid having to repeat the uppercase letters, use the flag IGNORECASE:

if re.match(r'^[a-zçáéíóúãõâêô]+(?: [a-zçáéíóúãõâêô]+)+$', str(nome).strip(), flags = re.IGNORECASE):

This regex will still refuse names that contain an apostrophe (like "D'Quino", for example), but I think it’s up to you to decide how much you want to "complicate" the regex and how precise you want it to be.

  • 1

    Many thanks for the reply, explanation and references. Grateful.

0

Try to use these regex:

def __validaEmail(self, email):
    result = re.match(r'''(
            [a-zA-Z0-9,_%+-]+   # username
            @                   # @ simbol
            [a-zA-Z0-9.-]+      # domain name
            (\.[a-zA-Z]{2,4}){1,2} # dot-something
            )''', str(email), re.VERBOSE)
    if result == None:
        return False

def __validaTelefone(self, telefone):
    result = re.match(r'''(
            (\d{2}|\(\d{2})\)?  # area code
            (\s|-|\.)?          # separator
            (\d{4})             # first 3 digits
            (\s|-|\.)           # separator
            (\d{4})             # last 4 digits
            (\s*(ext|x|ext.)\s*(\d{2,5}))? # extention
            )''', str(telefone), re.VERBOSE)
    if result == None:
        return False
  • To be able to use a multi-line regex with comments, you must pass the flag VERBOSE. But the main problem is neither the regex itself, but the fact that the function does not return anything when result is not None

  • @hkotsubo, got it. Well, I think the top answer will help you, good luck!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.