Regex to capture links from sites specific to an HTML page

Asked

Viewed 326 times

1

My goal was to make a Rawler web that found links (in many ways, but starting with http/s) in the HTML page of a site, I used requests and regular expressions, a part of the code went like this:

import requests
import re

to_crawl = ["https://g1.globo.com"]
crawled = set()

url = to_crawl[0]
code = requests.get(url)
html = code.text
kkk = "href=\"http://google.com\""

regex = re.findall(r"href=[\"'](https?:\/\/\w+\.\w+\.?\w+?\/?)", kkk)

print(regex)
print(len(regex))

The goal was a regex that took the 3 examples:

My current goal was to get at least the first two, but for some reason, my regex returns nothing in the first two cases.

Note: I have no idea how to make a regex that can include the third link and this is not the focus of the question, but if you can help me with it also I am very grateful.

  • 2

    Do not use regex to manipulate HTML. In your case, an alternative is to use Beautiful Soup: https://answall.com/a/440262/112052

  • 2

    Because a regex to get URL’s is much more complicated than it seems: https://stackoverflow.com/q/161738

2 answers

2


As has already been said here (and also here, here and mainly here - and in many other places around), regex is not the ideal tool to manipulate HTML (read each of the links to understand all the reasons).

In your case, an option would be to use a dedicated library, such as Beautiful Soup. With it, it’s easy to find all the links on a page:

from bs4 import BeautifulSoup

html = # obter o HTML da página...
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    print(link['href'])

A regex might work, but if you read the links indicated at the beginning of the answer, you will see that there are many situations that a regex cannot detect (or even can detect, but it gets so complicated that it is not worth it).

But nothing prevents you from using regex along with Beautiful Soup, as it is now a more restricted and controlled environment:

for link in soup.find_all('a', href=re.compile(r'https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)')):
    print(link['href'])

That is, here it is not so problematic to use regex, because I am sure that the search is done only in the attribute href of tags a (without the false positives that a regex could bring, such as if the tag was commented, or if the link was in another tag - or in the middle of the Javascript code that came along with the page - or if the text was not an HTML, etc.). In that case, I’m looking for links http or https, which belong to one of the fields indicated (google.com, seial.org or pt.wikipedia.org).

But if the idea is to validate URL’s, why not use a dedicated lib? You can use, for example, urllib:

from urllib.parse import urlparse

# verifica se uma URL é válida
def url_valida(url):
    try:
        parsed_url = urlparse(url)
        if not parsed_url:
            return False

        # deve ser http ou https, hostname deve ser google.com ou seila.org
        # ou, se for pt.wikipedia.org, verificar o restante da URL (wiki/Diretorio1/etc...)
        return parsed_url.scheme in ('http', 'https') and \
               (parsed_url.hostname in ('google.com', 'seila.org') \
                or (parsed_url.hostname == 'pt.wikipedia.org' and parsed_url.path == '/wiki/Diretorio1/diretorio2/diretorio3/naoseikkk5'))
    except ValueError:
        return False

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=url_valida):
    print(link['href'])

So I check if the link is http or https, and if the address is one of what I need (google.com or seila.org, or, if it is pt.wikipedia.org, the rest of the URL must be /wiki/Diretorio1/etc...).


For the record, your regex didn’t work because the shortcut \w take any letters, numbers or the character _ (that is, in practice you were picking up anything that looks like a URL). But in the end you used \w+?, who uses the quantifier "lazy", which takes as few characters as possible (read here and here to better understand). That is, if the URL is http://google.com, and regex only takes http://google.com.b - see here.

You could even use the regex suggested above (https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)), But as I said, this is very prone to false positives. The tag may be commented, the link may be "loose" in the text (or as an attribute of another tag, or in the middle of the Javascript that came along with the page, etc). Regex only looks at the text itself, without taking into account its structure (it doesn’t even need to be an HTML). Already using Beautiful Soup (or any other own lib for HTML/XML), you get a more reliable way to handle the data (as well as being less complicated, as a regex to validate URL’s is nothing trivial, and if it is to treat the special cases already mentioned, it will become increasingly complicated and impracticable to use).

Not to mention that Beautiful Soup allows more control over tags. For example, if you want the tag text a, just use link.text. If the tag a has other tags inside it (such as a img, etc), you can get all the content from it with link.decode_contents(), and so on. With regex, you would have to increase it to include these cases, complicating it further. It’s not worth it.

And simplistic solutions can fool you, because it seems that "worked", as is the example of another answer. She uses .*?, which is basically "zero or more occurrences of any character", ie, will accept anything that comes after http or https, until it finds quotes. So it’s not restricting any link, and the use of . together with the quantifier Lazy (as the links already mentioned above) makes regex extremely inefficient. So much so that the example placed gives timeout, so inefficient that this is.

Don’t get me wrong, regex is legal - me particularly quite a taste - but is not always the best solution.

-2

Hello!

As your goal is to get only the links, without manipulating the rest of the html content, I believe the regex below can help you:

\"(http(s?)://.*?)\"

Since you are searching for the link from an html content (so it is in quotes) and there is always http(s) information in the url, you should filter the urls you need.

A very cool place to test regex is this site:

https://regex101.com/r/JkGZrM/2

This link is already a test with the html page you are ordering and getting the links from it.

I hope it helps, hugs!

Edit: Actually, this solution is as simplistic as possible, because, as already said in other comments, making a regex that captures all urls is nothing simple, even more without defining some premises for it to occur, the intention of the above regex is to give some possibility of the person who asked the question to be able to advance towards the goal of it which is to get the links from the content of the page, and this based only on the placed so far, the other answer presupposes somerequirements that at no time have been defined and that do not know the real need for use of the person, sometimes the need is just to have a simple script that helps you achieve a final goal and this is only an intermediate step and define pre-setrequirements, possibly unnecessary, only adds barriers that may not even be part of the scope of the problem.

Ah and as spoken, the regex is very simplistic even, because it does not start from any assumption, unless the url has http or https, and therefore, for the size of the text to be analyzed, the timeout occurs. The default of regex101 is 2 seconds and for this text, the regex runs in +/- 5 seconds, so we had to adjust this limit.

So, for you who asked the question, if your goal is to do something that is part of a system in production, that has better defined prerequisites than those placed here and that performance is an important point, using Beautiful Soup is surely a better option, but if your goal is just to kill a little ant, and move forward with your goal, that was the purpose of the answer in using the above regex and help in an alternative and yet, knowing the regex101 site, which if it is not part of its use, may turn out to be a very useful tool. Hugs, I hope that with the alternatives shown, some of them meet your challenge.

  • 1

    I didn’t invent any requirements, the question is quite clear: "The goal was a regex that took the 3 examples: https://google.com, http://seila.org and https://pt.wikipedia.org/wiki/diretorio1/diretorio2/diretorio3/naoseikkk5". And I made a code that takes these three cases, "based only on the one placed so far". If you are going to analyze it well, you made more assumptions than I, about the purpose of the code, the need of the author of the question, etc :-)

  • You define that all the links that the person wants to search for, are within a href is a prerequisite, no? Or that "loose links in the text" should not be accounted for is another prerequisite, right? Nor did I enter into the merit of your solution, you who did not know how to interpret regex as an alternative to the solution of the problem and not as a solution better or worse than that of anyone, but that your solution has premises that have not been pointed out, this is a fact.

  • 1

    The regex of his code begins with href=, then it is reasonable to assume that the links should be within href. As for loose links and other cases, I only mentioned to point out the difference between using regex and Beautiful Soup (one can bring this data, the other not). At no time did I say that it has to be like this because "I want to", just said that the result may be different depending on the solution adopted.

  • 1

    As for using regex to manipulate HTML, the links I left at the beginning of the answer serve to indicate why it is not the ideal solution (the first links, including, have very detailed explanations about it). That’s why I didn’t suggest a regex (but I suggest you read at least that and you’ll understand why I chose not to: the regex becomes too complicated to be worth)

  • Thank you very much, Hugo. I do not believe that using regex is the best option, as I said in the other comments, but without understanding the requirements and the objective, I see no harm in talking about regex as an alternative precisely for the ease of use and for being naturally in the language, what using a lib, for example, can impose barriers to someone more beginner and who may come across the same problem. Thanks!!!

  • 1

    I see your point. But the idea of the site is to be a repository of knowledge about programming, and the answers should be useful not only to those who asked, but to anyone who visits the site in the future and has the same problem. That is why we must respond with this in mind and not limit ourselves to such barriers. If the best solution is a lib that is not native to the language, so be it :-)

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.