2
I am trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links starting with http or https):
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
How can I modify or create a new rgex that just picks up links that start with http or https? I don’t want to keep the word "href" just "http://..." or "https://..." They do not serve, for example: "media/test", "G1/noticia"
padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)
standard also not 100% functional:
Some left with " at the end, which was not meant to occur!
Which python are you using? 2 or 3?
– Miguel
Use python 3.4 but also have python 2.7 installed Backbox Linux.
– Ed S