Extracting email addresses from a web page using regular expressions in python

Asked

Viewed 154 times

1

What I did:

from urllib.request import urlopen
from re import findall

def emails(url):
    content = urlopen(url).read().decode()

    #print(content)
    padrao = "(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"

    mails = findall(padrao, content)
    print(mails)


url = "http://www.cdm.depaul.edu"
emails(url)

Apparently the regular expression (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$) is correct but the output of the program is being an empty list. What is happening?

  • 1

    Well, the answer below already says that the problem was the ^ and the $, that indicate the beginning and end of the string (i.e., regex would only find something if the string had only email). About using regex to validate emails, there are a few things here, here, here and here (this last link has some options at the end, just do not recommend the last regex).

1 answer

3


Use the following regular expression

/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/

soon your code gets:

from urllib.request import urlopen
from re import findall

def emails(url):
    content = urlopen(url).read().decode()

    #print(content)
    padrao = "([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)"

    mails = findall(padrao, content)
    print(mails)


url = "http://www.cdm.depaul.edu"
emails(url)

exit:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']

Anchors in regular expressions determine a position before, after, or between characters. They can be used to "anchor" regex correspondence to a given position. The intercalation sign corresponds to the position before the first character in the sequence. The application of regex " a" in "abc" corresponds to "a". In turn the regex " b" does not match in "abc", because "b" cannot be matched right after the start of the string, matched by the character . To test regular expressions intuitively recommend the following site https://rubular.com/

  • I could explain to me what was wrong with the regex?

  • Anchors in regular expressions determine a position before, after, or between characters. They can be used to "anchor" regex correspondence to a given position. The intercalation sign corresponds to the position before the first character in the sequence. The application of regex " a" in "abc" corresponds to "a". In turn the regex " b" does not match in "abc", because "b" cannot be matched right after the beginning of the string, matched by the character . The same logic for the $ character only for the end of the sentence. I hope I helped. @Eds

Browser other questions tagged

You are not signed in. Login or sign up in order to post.