Web Crawler (Spider) with ajax in JSF using Node.js or Jsoup api in java

Asked

Viewed 568 times

2

I have the task of creating an interface optimized for touch monitor, taking data from a website (http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml).

This site gives a listing of bus lines and consults their schedules, using a auto-complete ajax.

As it is a government body, the possibility to obtain the data otherwise is almost nil.

I thought of making a java or Node.js Crawler to go in the request url, pass the site parameters (inputs) and filter in the answer what I need. Easy! Only in theory :(

I made a request at this url:

http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml;jsessionid=1820D695BDE4B916EC808F84BD1B335D

Using this http header with the webcrawler module of Node.js:

Accept:application/xml, text/xml, */*; q=0.01
Accept-Encoding:gzip, deflate
Accept-Language:pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Content-Length:457
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:JSESSIONID=1820D695BDE4B916EC808F84BD1B335D
Faces-Request:partial/ajax
Host:www.consultas.der.mg.gov.br
Origin:http://www.consultas.der.mg.gov.br
Referer:http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
X-Requested-With:XMLHttpRequest

And the date form below, where I used the number 6 as a query for the autocomplete, which on the site brings a listing:

javax.faces.partial.ajax:true
javax.faces.source:form:tabview:campoBusca
javax.faces.partial.execute:form:tabview:campoBusca
javax.faces.partial.render:form:tabview:campoBusca
form:tabview:campoBusca:form:tabview:campoBusca
form:tabview:campoBusca_query:6
form:form
form:tabview:campoBusca_input:6
form:tabview:campoBusca_hinput:6
form:tabview_activeIndex:0
javax.faces.ViewState:-6275073363975845032:-2043218073946595619

But that was the answer:

inserir a descrição da imagem aqui

I tried also in java, using Jsoup, but it was worse, returned a lifecicle Exception.

I was caught on the curve. How to make a functional webcrawler in this scenario?

  • After posting the reward I saw the Portuguese error: *cited.

  • Have you tried using Selenium? Not without Java or Node, but for Python and R it has a good interface. Since it automates operations on the browser itself, it should work.

  • Thank you Daniel. I took a firefox plugin to test Selenium. And I saw that it has a pro java api. What would data extraction look like with Python? Can you give an example?

  • Am I needing to make a DER Crawler too, any evolution in the problem? It would be of great help!

  • Rodrigo, I took another path. Because my demand was to make a query terminal by adjusting the interface to touch, I simply modified the page at runtime using an extension to create macros emacs6 in Chrome. Is attending well.

1 answer

9


Making Crawler of JSF applications is virtually impossible for a simple reason: JSF is stateful - in 99.9% of cases.

This means that you cannot make an arbitrary request to the site, otherwise... you will have the error of Lifecycle. This is because the system stores information in the session, which will not be present when making the request outside the browser.

I don’t have the solution to this particular situation, but technically you would need to similar the actual use of the system and not only the final request. This can be made easier if you manually access and monitor requests using the developer tool. Sort of like you already have, but including previous steps and sending the session ID on all requests.

Note: It is not because the data is available on the Internet that you can simply capture them and put on your website. Always check data usage rights.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.