How to use scrapy on Asp.net pages

Asked

Viewed 207 times

-1

Well folks, my question is this::

I have to download the excel file of the product that contains the description "Maíz", Product type "Los Démas. En grano." and marketing "In bulk with 15 % pocketed", of that Website

Using the Libraries requests and beautifulsoup I was managing to extract the information from the Grid, however, I was not able to do what I wanted, which is to just click the second button "Exporta informacion Diária".

Analyzing the "Network" tab of Developertools he makes two calls, one of them creating a temporary link and then this temporary link redirects to the file. xls.

Após clicar no botão de download

After not being able to do what I wanted with Ibraries Requests and Beautifulsoup, I left then for the famous Scrapy.

I’ve been able to extract data from several pages with Scrapy, but still, I haven’t had any success in extracting the page in Asp.net.

Anyway, I just wanted someone to point me in the right direction, no need for codes.

Do I have to submit a Post Request with the information I want? If yes, how could I do that and then download the file. xls?

Thank you in advance.

1 answer

1


Scrapy is a module that combines the twisted.web to download the pages and the BeautifulSoup or the lxml to parse them.

Your problem is that the button you want to click probably doesn’t actually exist even on the page! It is very likely that the pages of the site you want to use, as well as many others on the internet, are made available incomplete without all the elements, and only then these elements are placed on the page through code done in javascript that runs in your browser after loading.

Therefore, when inspecting the page code using your browser, javascript will have already run and completed the elements dynamically, so you will find the button there. As neither BeautifulSoup and neither lxml execute javascript, on the page they parsed in memory when running the script the button does not exist.

This is very common on web pages nowadays, which are quite dynamic. Leaving you with two options:

  1. Scan the javascript code on the page and find out where it creates the button. Or analyze what the button does. You can read and follow the javascript code manually until you find a way to imitate what it does by clicking this button, what parameters to pass, etc. Then write code in python to simulate these actions. It is not an easy task but the code would be very optimized because it would be python code without having to open a real browser, which would be the second option:

  2. Use a real browser, which javascript scroll. The library Selenium allows you to open and control a real browser window through your script. Since the page will open in a real browser, javascript will work and you can click the button. The downside is that opening a browser is heavy and slow, in addition to loading various elements and images unnecessary to the process, therefore it would not be as efficient as directly accessing the source.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.