About your regex:
^[\w-]+@ [a-z\d]+\.[\w]{3}
I don’t know if it was a typo, but notice there’s a gap after the @
. This makes the expression only validate emails that has a space there (like user@ email.com
). So the first thing to do is remove that space.
Another detail is that shortcut \w
corresponds to letters, numbers and the character _
. And in Python 3, by default, it also corresponds to other letters defined in Unicode, such as Japanese characters (and several other languages), for example:
import re
print(re.match(r'\w+', '鳥山.').group()) # imprime 鳥山
If you only want the letters of our alphabet, you can use flag ASCII
, or simply use [a-zA-Z0-9]
in place of \w
:
# ambos imprimem "None", pois não encontram mais nenhum match
print(re.match(r'\w+', '鳥山.', flags=re.ASCII))
print(re.match(r'[a-zA-Z0-9]+', '鳥山.'))
Another detail is that you used [\w-]+
, which means "one or more occurrences of a \w
or one -
". And how \w
also includes the character _
, this means that regex will accept emails as [email protected]
.
Finally, the part after the @
ends with \.[\w]{3}
. First of all, [\w]
is redundant as \w
already represents a specific set of characters, and putting it between brackets is redundant (it only makes sense if you want to put other things together with the \w
, as you did with [\w-]
). Then you can switch to simply \w{3}
.
But this will only accept domains with exactly 3 letters (excluding the .io
, .br
, .info
, among many others). And how \w
also accepts numbers and _
, this regex accepts emails as user@teste._1_
. Not to mention that you don’t accept emails that end with com.br
, for example.
Then you can change everything after the @
for something like (?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}
. The parentheses (?:
and )
form a catch group. I’m basically grouping the sub-expression within them, and the ?:
tells the regex engine not to store what is captured (if you do not use ?:
, what is in parentheses is internally stored and can be obtained from match after - but since I don’t want that, I can indicate in the regex itself using the ?:
).
The above section ensures that [a-zA-Z0-9-]+\.
(letters, numbers or -
followed by a point), is repeated once or more (indicated by +
after parentheses). This ensures that we can have emails .com.br
, .abc.def.etc.com
and so on.
Finally, we have 2 or more letters ([a-zA-Z]{2,}
), which ensures that .br
and .info
(and any other, provided it has at least two letters) are accepted.
Another thing I would do is add the bookmark $
, that delimits the end of the string. You’ve used ^
(string start), then use it along with $
ensures that the whole string will have only what is in the expression, and nothing else.
Anyway, doing a regex that correctly validates 100% of valid emails is very complicated. See this article, for example, it starts with something not too complicated and ends with a monstrous regex.
It is up to you to decide how complicated your regex will be, because the more you need it (the more special cases it supports), the more complicated and difficult it will be to understand and maintain. But if there are special cases you don’t want to treat (such as IP addresses in the domain, or user@localhost
, for example), so it doesn’t pay to do something so complicated. Find the balance between accuracy, complexity and practicality (and this varies from one case to another).
I talk a little more using regex to validate emails here, here and here (the latter has some options at the end, just do not recommend the latter).
Regardless of the regex you choose, the check/filter/sort can be done like this:
import re
r = re.compile(r'^[\w-]+@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}$')
emails = ['[email protected]', 'nao sou email', '[email protected]']
# filtra e ordena
filtered_emails = sorted(email for email in emails if r.match(email))
print(filtered_emails)
I used the syntax of comprehensilist on, much more succinct and pythonic.
The line that creates filtered_emails
is equivalent to:
filtered_emails = []
for email in emails:
if r.match(email):
filtered_emails.append(email)
filtered_emails.sort()
In both cases, the resulting list is:
['[email protected]', '[email protected]']
To read the amount, I suggest validating if what was typed is really a number. If not, ask the user to try again (encapsulated this in a function).
Finally, I also use a comprehensilist on to read the emails and already put them in a list.
The complete code:
import re
def le_quantidade():
while True:
try:
return int(input('quantidade de emails:'))
except ValueError:
# se não digitar um número, int() lança um ValueError
print('Digite um número válido')
n = le_quantidade()
# lê os emails e coloca em uma lista
emails = [input('Digite um email:') for _ in range(n)]
r = re.compile(r'^[\w-]+@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}$')
# filtra e ordena
filtered_emails = sorted(email for email in emails if r.match(email))
print(filtered_emails)
Please correct the indentation of your code.
– Woss
In Python, indentation is not optional - it’s the syntax. When pasting code here, use the button
{}
to format while preserving identation.– jsbueno