Regex to capture links from sites specific to an HTML page

Question

Regex to capture links from sites specific to an HTML page

Asked 5 years, 1 month ago

Viewed 326 times

1

My goal was to make a Rawler web that found links (in many ways, but starting with http/s) in the HTML page of a site, I used requests and regular expressions, a part of the code went like this:

import requests
import re

to_crawl = ["https://g1.globo.com"]
crawled = set()

url = to_crawl[0]
code = requests.get(url)
html = code.text
kkk = "href=\"http://google.com\""

regex = re.findall(r"href=[\"'](https?:\/\/\w+\.\w+\.?\w+?\/?)", kkk)

print(regex)
print(len(regex))

The goal was a regex that took the 3 examples:

My current goal was to get at least the first two, but for some reason, my regex returns nothing in the first two cases.

Note: I have no idea how to make a regex that can include the third link and this is not the focus of the question, but if you can help me with it also I am very grateful.

2

Do not use regex to manipulate HTML. In your case, an alternative is to use Beautiful Soup: https://answall.com/a/440262/112052

– hkotsubo

2020/07/11 at 08:29
2

Because a regex to get URL’s is much more complicated than it seems: https://stackoverflow.com/q/161738

– hkotsubo

2020/07/11 at 08:50

2 answers

2

As has already been said here (and also here, here and mainly here - and in many other places around), regex is not the ideal tool to manipulate HTML (read each of the links to understand all the reasons).

In your case, an option would be to use a dedicated library, such as Beautiful Soup. With it, it’s easy to find all the links on a page:

from bs4 import BeautifulSoup

html = # obter o HTML da página...
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    print(link['href'])

A regex might work, but if you read the links indicated at the beginning of the answer, you will see that there are many situations that a regex cannot detect (or even can detect, but it gets so complicated that it is not worth it).

But nothing prevents you from using regex along with Beautiful Soup, as it is now a more restricted and controlled environment:

for link in soup.find_all('a', href=re.compile(r'https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)')):
    print(link['href'])

That is, here it is not so problematic to use regex, because I am sure that the search is done only in the attribute href of tags a (without the false positives that a regex could bring, such as if the tag was commented, or if the link was in another tag - or in the middle of the Javascript code that came along with the page - or if the text was not an HTML, etc.). In that case, I’m looking for links http or https, which belong to one of the fields indicated (google.com, seial.org or pt.wikipedia.org).

But if the idea is to validate URL’s, why not use a dedicated lib? You can use, for example, urllib:

from urllib.parse import urlparse

# verifica se uma URL é válida
def url_valida(url):
    try:
        parsed_url = urlparse(url)
        if not parsed_url:
            return False

        # deve ser http ou https, hostname deve ser google.com ou seila.org
        # ou, se for pt.wikipedia.org, verificar o restante da URL (wiki/Diretorio1/etc...)
        return parsed_url.scheme in ('http', 'https') and \
               (parsed_url.hostname in ('google.com', 'seila.org') \
                or (parsed_url.hostname == 'pt.wikipedia.org' and parsed_url.path == '/wiki/Diretorio1/diretorio2/diretorio3/naoseikkk5'))
    except ValueError:
        return False

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=url_valida):
    print(link['href'])

So I check if the link is http or https, and if the address is one of what I need (google.com or seila.org, or, if it is pt.wikipedia.org, the rest of the URL must be /wiki/Diretorio1/etc...).

For the record, your regex didn’t work because the shortcut \w take any letters, numbers or the character _ (that is, in practice you were picking up anything that looks like a URL). But in the end you used \w+?, who uses the quantifier "lazy", which takes as few characters as possible (read here and here to better understand). That is, if the URL is http://google.com, and regex only takes http://google.com.b - see here.

You could even use the regex suggested above (https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)), But as I said, this is very prone to false positives. The tag may be commented, the link may be "loose" in the text (or as an attribute of another tag, or in the middle of the Javascript that came along with the page, etc). Regex only looks at the text itself, without taking into account its structure (it doesn’t even need to be an HTML). Already using Beautiful Soup (or any other own lib for HTML/XML), you get a more reliable way to handle the data (as well as being less complicated, as a regex to validate URL’s is nothing trivial, and if it is to treat the special cases already mentioned, it will become increasingly complicated and impracticable to use).

Not to mention that Beautiful Soup allows more control over tags. For example, if you want the tag text a, just use link.text. If the tag a has other tags inside it (such as a img, etc), you can get all the content from it with link.decode_contents(), and so on. With regex, you would have to increase it to include these cases, complicating it further. It’s not worth it.

And simplistic solutions can fool you, because it seems that "worked", as is the example of another answer. She uses .*?, which is basically "zero or more occurrences of any character", ie, will accept anything that comes after http or https, until it finds quotes. So it’s not restricting any link, and the use of . together with the quantifier Lazy (as the links already mentioned above) makes regex extremely inefficient. So much so that the example placed gives timeout, so inefficient that this is.

Don’t get me wrong, regex is legal - me particularly quite a taste - but is not always the best solution.

Browser other questions tagged html python python-3.x regex

You are not signed in. Login or sign up in order to post.

by Rafael Ribeiro • 1 point · Answer 1 · 2020-07-11T17:59:53+00:00

Hello!

As your goal is to get only the links, without manipulating the rest of the html content, I believe the regex below can help you:

\"(http(s?)://.*?)\"

Since you are searching for the link from an html content (so it is in quotes) and there is always http(s) information in the url, you should filter the urls you need.

A very cool place to test regex is this site:

https://regex101.com/r/JkGZrM/2

This link is already a test with the html page you are ordering and getting the links from it.

I hope it helps, hugs!

Edit: Actually, this solution is as simplistic as possible, because, as already said in other comments, making a regex that captures all urls is nothing simple, even more without defining some premises for it to occur, the intention of the above regex is to give some possibility of the person who asked the question to be able to advance towards the goal of it which is to get the links from the content of the page, and this based only on the placed so far, the other answer presupposes somerequirements that at no time have been defined and that do not know the real need for use of the person, sometimes the need is just to have a simple script that helps you achieve a final goal and this is only an intermediate step and define pre-setrequirements, possibly unnecessary, only adds barriers that may not even be part of the scope of the problem.

Ah and as spoken, the regex is very simplistic even, because it does not start from any assumption, unless the url has http or https, and therefore, for the size of the text to be analyzed, the timeout occurs. The default of regex101 is 2 seconds and for this text, the regex runs in +/- 5 seconds, so we had to adjust this limit.

So, for you who asked the question, if your goal is to do something that is part of a system in production, that has better defined prerequisites than those placed here and that performance is an important point, using Beautiful Soup is surely a better option, but if your goal is just to kill a little ant, and move forward with your goal, that was the purpose of the answer in using the above regex and help in an alternative and yet, knowing the regex101 site, which if it is not part of its use, may turn out to be a very useful tool. Hugs, I hope that with the alternatives shown, some of them meet your challenge.