How do I get the final URL of a JS redirect?

Asked

Viewed 697 times

3

I was trying to make a code to get the final url of redirecting some links, I managed to do for most of the links I needed, incoming for this I could not: https://redir.lomadee.com/v2/987163d4

All other links worked with urllib2 or requests.

    s = requests.Session()
    r = s.get(lili[i], headers=headers)
    if lili[i] != r.url:
        print i, r.url

or

    response = urllib2.urlopen(lili[i])
    if lili[i] != response.geturl():
        print i, response.geturl()

Does anyone know how to fix this? I wouldn’t want to use Selenium for that, it’s not feasible (too long).

1 answer

3

Curious this strategy in this kind of services, is very well played to avoid precisely what you want to do.

Here’s what’s going on. It seems, but it is not a redirection (code 301), that is, when analyzing the body of the answer I could see (luckily) what was going on:

setTimeout(location.href='https://www.walmart.com.br/dvd-automotivo-pioneer-avh-3880-com-usb-frontal-e-tela-de-7/3820066/pr?utm_term=22696088&utm_campaign=lomadee&utm_medium=afiliados&utm_source=lomadee&lmdsid='+new Date().getTime().toString().slice(8,12)+'29157007', 500);

Now this is a redirect but it is only delegated after the page is already on this side (client side) and the javascript is interpreted, hence with requests you can not see this to be performed, this serves to "guarantee" that the request was made from a browser.

Here’s a workaround to get the url, in this specific service, where it goes next with requests (with urllib2 would be the same thing):

import requests, re

req = requests.get('https://redir.lomadee.com/v2/987163d4')
redi_url = re.findall('(?<=location.href=["\'])https?://.+?(?=["\'])', req.text)

if redi_url:
    print(redi_url[0]) # https://www.walmart.com.br/dvd-automotivo-pioneer-avh-3880-com-usb-frontal-e-tela-de-7/3820066/pr?utm_term=22696088&utm_campaign=lomadee&utm_medium=afiliados&utm_source=lomadee&lmdsid=

Here I believe that some colleagues(s) who are more suited to regular expressions can help me, in this context does not seem to be the best way to use regex (total body of response here, the setTimeout that redirects is really close), and feel free to edit the answer.

  • 2

    That’s basically it (but it will only work for this particular service). And in other situations, the link information will be in different contexts. To be compatible with multiple redirectors, only with some engine that actually runs JS, or implementing a Crawler for each vendor (one for lomadee, one for redirector X, etc.) and still have to keep updating whenever a case is detected where the link is not found (which is a sign that particular redirector has changed the page structure).

  • @Exact Bacco would give a solution with Lenium + phantomJS but AP stressed that it did not want to opt for this solution. Although it is volatile and only make sure it works for this source url

  • 1

    I hope that the questioner is doing this precisely to eliminate short link intermediaries. It is suggested to him to remove the intermediary, remove the tracking data from the final link, otherwise he will simply be "eternalizing" the partner ID of the intermediary. (I do this in my systems, whenever possible I remove everything that is from Analytics, as ?utm= and one more series, of all the links I process.

  • @Bacco yes, from here it would only be to eliminate the redi_url what’s after the "?" a split would suffice

  • @Bacco That’s right, I filter the utm_ that appear in these links, entrto is difficult to predict because it varies from site to site. Miguel, unfortunately it is not as simple as giving a split on Question mark, as this may lead to some parameters that stay after it, such as search parameters on the site. If anyone has any tips? To delete only the parameters that set partner cookies, I would like to know more about :)

  • @Leo04 would have to be case by case. A tabelinha relating source URL with the capture code of the url, ai already takes and removes the specific of each, since it has to customize the same code. What sucks is that you have to keep doing maintenance after. The best thing is a thing that tells you when you can’t find the information, because then you have to "fix it fast"

  • I have some services that unfortunately depend on Crawler, because the tapir that made the origin site not API, so besides capturing what I need, I put code to alert me when something is structurally abnormal. At least when one of the sources changes the code, I know it in time. Relevant detail: it’s not about improper consumption, quite the contrary, my crawlers are much lighter than a manual query, which carries a lot of extra stuff (and that the end user would have to access anyway).

  • @leo04 what would be the ideal return of the url in this case (which parameters you need/don’t need to keep)? which I try to adjust the response. Keep in mind that it will only work for this redirector, but it is feasible until the structure of it is changed

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.