Differences between Rcurl, httr (R) and requests (python) when making a POST

Asked

Viewed 255 times

4

I was wanting to access a page that gets you by clicking "Displays all the above documents" at that link. The company I took is just an example, I have no interest in it.

I tried to resolve this through a POST request, and got the result I wanted using the library requests python. Python code below:

import requests
link = "http://siteempresas.bovespa.com.br/consbov/ExibeTodosDocumentosCVM.asp?CNPJ=02.541.982/0001-54&CCVM=22551&TipoDoc=C&QtLinks=10"
r = requests.get(link)
dados={'hdnCategoria':'0', 'hdnPagina':'', 'FechaI':'', 'FechaV':''}
r1 = requests.post(link, data=dados, cookies=r.cookies)
print r1.text

I tried to run the following codes on the R, one using RCurl:

library(RCurl)
link <- "http://siteempresas.bovespa.com.br/consbov/ExibeTodosDocumentosCVM.asp?CNPJ=02.541.982/0001-54&CCVM=22551&TipoDoc=C&QtLinks=10"
curl <- getCurlHandle()
r <- getURL(link, curl=curl)
r1 <- postForm(link, hdnCategoria='0', hdnPagina='', FechaI='', FechaV='', .encoding='UTF-8', curl=curl)
  cat(r1)

and another using httr (which I understand is just a wrapper from RCurl):

library(httr)
link <- "http://siteempresas.bovespa.com.br/consbov/ExibeTodosDocumentosCVM.asp?CNPJ=02.541.982/0001-54&CCVM=22551&TipoDoc=C&QtLinks=10"
h <- handle(link)
dados=list(hdnCategoria='0', hdnPagina='', FechaI='', FechaV='')
r1 <- POST(handle=h, body=dados, encoding='UTF-8')
cat(content(r1, 'text'))

a) Why the two alternatives in R return to the original page and not the result of clicking "Displays all the above documents"?

b) What the python library has "the most", that makes it work so simply?

PS: For this question, I would not like to use mechanize, selenium, other python libraries, etc. Wanted to resolve in R, preferably with httr and, if not, with RCurl. There is also a new alternative, the rvest, but I don’t know very well and I don’t know if it makes sense to use in this specific case.

  • I don’t know R, but from what I understood from your code the main difference is that in the example in Python you are passing to the POST the cookies obtained in the first GET, and in the other codes it seems to me that this is not being done. The importance of this is that many websites require you to pass a secret token in the POST to protect yourself against CSRF, or some similar measure (it is not obvious to me what this particular site is doing, but the fact is that there are cookies with random values being sent in the second request). You need to figure out how to do this in R.

  • Ignore the previous comment - even after clearing cookies in the browser, it still loads the page normally. The problem is not there...

  • @mgibsonbr apparently the functions that have the word Handle serve precisely to pass cookies through requests

1 answer

2


Interestingly, I was able to solve the problem by simplifying the code of httr. It seems that the package has received an update and now receives a parameter encode, that can receive multipart (pattern), form (what I want to do) and json.

In addition, the httr already stores cookies between sections by default. The code below worked

library(httr)
link <- "http://siteempresas.bovespa.com.br/consbov/ExibeTodosDocumentosCVM.asp?CNPJ=02.541.982/0001-54&CCVM=22551&TipoDoc=C&QtLinks=10"
aux <- GET(link)
dados=list(hdnCategoria='0', hdnPagina='', FechaI='', FechaV='')
r1 <- POST(link, body=dados, encode='form')
cat(content(r1, 'text'))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.