Extracting email addresses from a web page using regular expressions in python

Question

Extracting email addresses from a web page using regular expressions in python

Asked 5 years, 6 months ago

Viewed 154 times

1

What I did:

from urllib.request import urlopen
from re import findall

def emails(url):
    content = urlopen(url).read().decode()

    #print(content)
    padrao = "(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"

    mails = findall(padrao, content)
    print(mails)


url = "http://www.cdm.depaul.edu"
emails(url)

Apparently the regular expression (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$) is correct but the output of the program is being an empty list. What is happening?

1

Well, the answer below already says that the problem was the ^ and the $, that indicate the beginning and end of the string (i.e., regex would only find something if the string had only email). About using regex to validate emails, there are a few things here, here, here and here (this last link has some options at the end, just do not recommend the last regex).

– hkotsubo

2020/01/20 at 20:06

1 answer

Browser other questions tagged python python-3.x regex

You are not signed in. Login or sign up in order to post.

by BrunoVdutra • **471** points · Answer 1 · 2020-01-20T19:22:24+00:00

Use the following regular expression

/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/

soon your code gets:

from urllib.request import urlopen
from re import findall

def emails(url):
    content = urlopen(url).read().decode()

    #print(content)
    padrao = "([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)"

    mails = findall(padrao, content)
    print(mails)


url = "http://www.cdm.depaul.edu"
emails(url)

exit:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']

Anchors in regular expressions determine a position before, after, or between characters. They can be used to "anchor" regex correspondence to a given position. The intercalation sign corresponds to the position before the first character in the sequence. The application of regex " a" in "abc" corresponds to "a". In turn the regex " b" does not match in "abc", because "b" cannot be matched right after the start of the string, matched by the character . To test regular expressions intuitively recommend the following site https://rubular.com/