Regular expression (regex) for links to web pages using Python

Question

Regular expression (regex) for links to web pages using Python

Asked 8 years, 11 months ago

Viewed 1,231 times

2

I am trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links starting with http or https):

import re   
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

How can I modify or create a new rgex that just picks up links that start with http or https? I don’t want to keep the word "href" just "http://..." or "https://..." They do not serve, for example: "media/test", "G1/noticia"

padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)

standard also not 100% functional:

http://g1.globo.com/rio

http://g1.globo.com/"

Some left with " at the end, which was not meant to occur!

Which python are you using? 2 or 3?

– Miguel

2016/07/31 at 19:19
Use python 3.4 but also have python 2.7 installed Backbox Linux.

– Ed S

2016/07/31 at 20:10

2 answers

4

import urllib, re

url = "/q/143677"
html = urllib.urlopen(url).read()

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)

for url in urls:
    print url

Regular expression will match anything between href=" and " and may it be http or https.

You can also make use of a parser for this, for example the Beautiful Soup:

import urllib, re
# Para a versao 4.x use from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup

url = "/q/143677"
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)
urls = soup.findAll('a', attrs={'href': re.compile("^https?://")})

for tag in urls:
    print tag['href']

Don’t need to put r before '(?<=href=["'])https?: //.+?(?=["']) ' ? to be raw string?

– Ed S

2016/07/31 at 20:25
@Eds Yes, I had forgotten to put. See if the first option with regex does what you want because I didn’t get to test it on other pages.

– stderr

2016/07/31 at 21:08
Thank you. I will test and return you!

– Ed S

2016/07/31 at 23:33
It worked perfectly. Thank you very much!

– Ed S

2016/07/31 at 23:40

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by Miguel • **29,306** points · Answer 1 · 2016-07-31T19:57:19+00:00

To complete the excellent @zekk response, here’s a solution for python 3.x:

import requests, re

url = "/q/143677"
html = requests.get(url).text

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
print(urls)