REGEX to remove websites, emails

Asked

Viewed 264 times

0

I want to remove website addresses, emails, etc...

url_regex = re.compile(r'(?i)(<|\[)?(https?|www)?(.*)?\.(.*){2,4}')
mail_regex = re.compile(r'(?i)(<|\[)?@(.*)\.(.*){2,3}')

This way I could remove for example:

http://www.google.com

http://www.twitter.com

[image.jpeg]

www.facebook.com

[www.amazon.com]

[email protected]

[email protected]

...

When tested in a text, these regex match the whole text and not just the site/email addresses.

  • 2

    And the question/problem is what exactly?

  • Oh sorry. It is that aundo I test in a text, these regex take the whole text. Very strange... I thought it was very accurate in the regex where it was to stop after the "."

1 answer

1


The problems of these regular expressions are in the operator .*. The operator * is greedy, ie, it will try to match as many characters as possible of the string.

The ideal is whenever possible, to construct a regular expression that has a stopping criterion. For example, can a URL or an email address have white space? If they cannot, their shutdown criteria is the blank character. Or an email or URL can only have letters, numbers and some characters (., -, _). So you can marry everyone until you find a character that’s not one of those.

Let’s define that an email has only letters, numbers and some characters (., -, _) and has a @ in the middle. A regular expression to validate email is beemm more complex than that, but this accepts 98% of existing emails.

mail_regex = re.compile('([a-z0-9_.-]+@[a-z0-9_.-]+)', re.IGNORECASE)

In this regular expression, we have 2 parts, one that accepts 1 or more characters from a to z, numbers and the 3 special characters we define. We expect after that a character @ and then the second part, where we accept the same things from the first part.

To match a url is the same thing, the difference is that our anchor is at the beginning of the text (http://, www or [).

url_regex  = re.compile('((http://|www|\[)[a-z0-9_.-]+]?)', re.IGNORECASE)

In this regular expression, we look first at the beginning of the text to see if it has http://, www or [. If so, we look at letters, numbers and the like. The only difference here is that we also look at the last character, if it is not ], in case the URL is surrounded by square brackets.

Finally, by running these expressions in the text you posted, we have the following result:

print (mail_regex.sub('E-MAIL', text))
http://www.google.com

http://www.twitter.com

[image.jpeg]

www.facebook.com

[www.amazon.com]

E-MAIL

E-MAIL

And on the urls:

print (url_regex.sub('URL', text))
URL

URL

URL

URL

URL

[email protected]

[email protected]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.