Regular expression (regex) for links to web pages using Python

Asked

Viewed 1,231 times

2

I am trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links starting with http or https):

import re   
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

How can I modify or create a new rgex that just picks up links that start with http or https? I don’t want to keep the word "href" just "http://..." or "https://..." They do not serve, for example: "media/test", "G1/noticia"

padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)

standard also not 100% functional:

http://g1.globo.com/rio

http://g1.globo.com/"

Some left with " at the end, which was not meant to occur!

  • Which python are you using? 2 or 3?

  • Use python 3.4 but also have python 2.7 installed Backbox Linux.

2 answers

5

To complete the excellent @zekk response, here’s a solution for python 3.x:

import requests, re

url = "/q/143677"
html = requests.get(url).text

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
print(urls)

4


import urllib, re

url = "/q/143677"
html = urllib.urlopen(url).read()

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)

for url in urls:
    print url

Regular expression will match anything between href=" and " and may it be http or https.

You can also make use of a parser for this, for example the Beautiful Soup:

import urllib, re
# Para a versao 4.x use from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup

url = "/q/143677"
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)
urls = soup.findAll('a', attrs={'href': re.compile("^https?://")})

for tag in urls:
    print tag['href']
  • Don’t need to put r before '(?<=href=["'])https?: //.+?(?=["']) ' ? to be raw string?

  • @Eds Yes, I had forgotten to put. See if the first option with regex does what you want because I didn’t get to test it on other pages.

  • Thank you. I will test and return you!

  • It worked perfectly. Thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.