Beautifulsoup - True href links

Asked

Viewed 860 times

1

I was studying about Webscraping with Python and started using the bs4 bilioteca (Beautifulsoup). When I started picking up the tags a and the attribute href, I realized that I could not access the link if on href had something like:

href="/alguma_pagina.php"

In the above case, I cannot simply request the value "/alguma_coisa.php", because this is not a valid url.

I need to get the real url where I’ll go when I click the link, not just the value that’s in href. How do I get this full url?

Remembering that there is the possibility of the url being of type "url.com.br/" with or without the bar at the end. href values may be of the type:

"#"
"#alguma_coisa"
"cadastro.php"
"/cadastro.php"
"http://outra_url.com"
"outra_url.com"

and each of these, can start or end with a space.

  • I believe the problem is with your data processing/validation, at first you are able to catch the URL that you want but it is not formatted as you want. Already tried to use regex to validate possible variations?

  • When I have outraurl.com or qualquercoisa.jsp there is no way to use regex to determine whether it is a link or a file. I thought of creating a vector with all the possible domains and see if the string ended with one of them, but it is very common.

  • with regex you can see if it’s a URL everything depends on how you apply the rules, and to my see there is no gambiarra in your case, if you need to perform a very specific validation it gets to be normal, what you can do is to look for a package that performs cleaner validation to save lines of code.

1 answer

1

Every time you are on a page and there is a link related to it the link corresponds to its own page plus its url.

You can use the lib urlparse to make a concatenation ugly and make a new request.

However, as you said yourself, sometimes the url is not relative. Let’s try to solve this case:

import urllib
urllib.parse.urljoin('http://google.com', 'http://ddg.gg')

In this case, since the two urls are absolute, it will always use the second one, so that you can keep a fixed url at the beginning and vary the second one.

Another case would be to add with the same function an absolute and a relative, for example:

urllib.parse.urljoin('http://ddg.gg/', 'teste.php')

The return would be 'http://ddg.gg/teste.php' which kills the case of relative urls.

The only case where this function will not resolve is the case that there is no http prefix in the second string, which would make the same join the two strings:

urllib.parse.urljoin('http://ddg.com/', 'teste.com')

The return would be 'http://ddg.com/teste.com' then it will be up to you to know if the url is valid or not.

Another option to use urlparse

import urllib
urllib.parse.urlparse('teste.com') 
# ParseResult(scheme='', netloc='', path='teste.com', params='', query='', fragment='')

Which in this case will return you a named tuple that can be used to see the attribute netloc. If it does not exist, it means that the url is not an absolute url. This solves the same case of the previous one, although I find the first implementation more pythonica.

In case the url has an absurd value, but without the prefix http, will fit you again. What I would recommend, you can create a list with badwords. A list that contains string suffix values, for example, ['.com', '.net', '.br', '.de'] and make a simple validation to see if any of the elements of that list are contained in the string, so you would also know that it is not relative and could use that criterion to make the request or not.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.