Every time you are on a page and there is a link related to it the link corresponds to its own page plus its url.
You can use the lib urlparse
to make a concatenation ugly and make a new request.
However, as you said yourself, sometimes the url is not relative. Let’s try to solve this case:
import urllib
urllib.parse.urljoin('http://google.com', 'http://ddg.gg')
In this case, since the two urls are absolute, it will always use the second one, so that you can keep a fixed url at the beginning and vary the second one.
Another case would be to add with the same function an absolute and a relative, for example:
urllib.parse.urljoin('http://ddg.gg/', 'teste.php')
The return would be 'http://ddg.gg/teste.php'
which kills the case of relative urls.
The only case where this function will not resolve is the case that there is no http prefix in the second string, which would make the same join the two strings:
urllib.parse.urljoin('http://ddg.com/', 'teste.com')
The return would be 'http://ddg.com/teste.com'
then it will be up to you to know if the url is valid or not.
Another option to use urlparse
import urllib
urllib.parse.urlparse('teste.com')
# ParseResult(scheme='', netloc='', path='teste.com', params='', query='', fragment='')
Which in this case will return you a named tuple that can be used to see the attribute netloc
. If it does not exist, it means that the url is not an absolute url. This solves the same case of the previous one, although I find the first implementation more pythonica.
In case the url has an absurd value, but without the prefix http
, will fit you again. What I would recommend, you can create a list with badwords. A list that contains string suffix values, for example, ['.com', '.net', '.br', '.de']
and make a simple validation to see if any of the elements of that list are contained in the string, so you would also know that it is not relative and could use that criterion to make the request or not.
I believe the problem is with your data processing/validation, at first you are able to catch the
URL
that you want but it is not formatted as you want. Already tried to useregex
to validate possible variations?– RFL
When I have
outraurl.com
orqualquercoisa.jsp
there is no way to use regex to determine whether it is a link or a file. I thought of creating a vector with all the possible domains and see if the string ended with one of them, but it is very common.– Eron Medeiros
with
regex
you can see if it’s aURL
everything depends on how you apply the rules, and to my see there is no gambiarra in your case, if you need to perform a very specific validation it gets to be normal, what you can do is to look for apackage
that performs cleaner validation to save lines of code.– RFL