2
I have the task of creating an interface optimized for touch monitor, taking data from a website (http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml).
This site gives a listing of bus lines and consults their schedules, using a auto-complete ajax.
As it is a government body, the possibility to obtain the data otherwise is almost nil.
I thought of making a java or Node.js Crawler to go in the request url, pass the site parameters (inputs) and filter in the answer what I need. Easy! Only in theory :(
I made a request at this url:
http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml;jsessionid=1820D695BDE4B916EC808F84BD1B335D
Using this http header with the webcrawler module of Node.js:
Accept:application/xml, text/xml, */*; q=0.01 Accept-Encoding:gzip, deflate Accept-Language:pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4 Connection:keep-alive Content-Length:457 Content-Type:application/x-www-form-urlencoded; charset=UTF-8 Cookie:JSESSIONID=1820D695BDE4B916EC808F84BD1B335D Faces-Request:partial/ajax Host:www.consultas.der.mg.gov.br Origin:http://www.consultas.der.mg.gov.br Referer:http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36 X-Requested-With:XMLHttpRequest
And the date form below, where I used the number 6 as a query for the autocomplete, which on the site brings a listing:
javax.faces.partial.ajax:true javax.faces.source:form:tabview:campoBusca javax.faces.partial.execute:form:tabview:campoBusca javax.faces.partial.render:form:tabview:campoBusca form:tabview:campoBusca:form:tabview:campoBusca form:tabview:campoBusca_query:6 form:form form:tabview:campoBusca_input:6 form:tabview:campoBusca_hinput:6 form:tabview_activeIndex:0 javax.faces.ViewState:-6275073363975845032:-2043218073946595619
But that was the answer:
I tried also in java, using Jsoup, but it was worse, returned a lifecicle Exception.
I was caught on the curve. How to make a functional webcrawler in this scenario?
After posting the reward I saw the Portuguese error: *cited.
– Eric Silva
Have you tried using Selenium? Not without Java or Node, but for Python and R it has a good interface. Since it automates operations on the browser itself, it should work.
– Daniel Falbel
Thank you Daniel. I took a firefox plugin to test Selenium. And I saw that it has a pro java api. What would data extraction look like with Python? Can you give an example?
– Eric Silva
Am I needing to make a DER Crawler too, any evolution in the problem? It would be of great help!
– Rodrigo Brito
Rodrigo, I took another path. Because my demand was to make a query terminal by adjusting the interface to touch, I simply modified the page at runtime using an extension to create macros emacs6 in Chrome. Is attending well.
– Eric Silva