Scraping data using Robobrowser

Question

Scraping data using Robobrowser

Asked 6 years, 10 months ago

Viewed 347 times

1

I’m trying to scrape a form, to insert an attachment and submit, using Robobrowser.

To open the page I do:

browser.open('url')

To get the form I make:

form = browser.get_form(id='id_form')

To enter the data into the form I do:

form['data_dia'] = '25'  # por exemplo

To submit the form I do:

browser.submit_form(form, form['btnEnviar'])

or just

browser.submit_form(form)

But this is not working, the form is not being sent. While trying to fetch all inputs gives page, I found that the send button is not coming by Robobrowser.

making,

todos_inputs = browser.find_all('input')

        for t in todos_inputs:
            print(t)

do not get the input tag with id 'btnEnviar', which in the html code of the page is inside the form. The other inputs of the form are coming, like 'day', 'month' and 'year', for example.

I didn’t post the html code because it needs login and password for access.

The problem is that Robobrowser is not able to scrape all the html information, just a part, making me unable to submit the form. Is there a solution to this? Or there is another way to fill out a form and send it with other tools except Robobrowser and Beautifulsoup?

1 answer

Browser other questions tagged python web-scraping beautifulsoup

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2018-12-12T15:51:13+00:00

Robobrowser is a module that combines the requests to download the pages and the BeautifulSoup to parse them.

Your problem is that the button you want to click probably doesn’t actually exist even on the page! It is very likely that the pages of the site you want to use, as well as many others on the internet, are made available incomplete without all the elements, and only then these elements are placed on the page through code done in javascript that runs in your browser after loading.

Therefore, when inspecting the page code using your browser, javascript will have already run and completed the elements dynamically, so you will find the button there. Since Beautifulsoup does not run javascript, on the page it parsed in memory when running the script the button does not exist.

This is very common on web pages nowadays, which are quite dynamic. Leaving you with two options:

Scan the javascript code on the page and find out where it creates the button. Or analyze what the button does. You can read and follow the javascript code manually until you find a way to imitate what it does by clicking this button, what parameters to pass, etc. Then write code in python to simulate these actions. It is not an easy task but the code would be very optimized because it would be python code without having to open a real browser, which would be the second option:
Use a real browser that runs javascript. The Selenium library allows you to open and control a real browser window through your script. Since the page will open in a browser, javascript will work and you can click the button. The downside is that opening a browser is heavy and slow, in addition to loading various elements and images unnecessary to the process, therefore it would not be as efficient as directly accessing the source.