7
I would like to shave a page that is in https using the package rvest
. However, it is a website with problems in the security certificate. In such cases, you need to turn off the SSL verification -- but I don’t know how to do this in this package. In the RCurl
and in the httr
is very simple. I give some examples below
That’s the page I intend to scrape:
sucupira = "https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"
That’s what I’m trying to do:
library(rvest)
read_html(sucupira) #NAO FUNCIONA
## Error in open.connection(x, "rb") :
## Peer certificate cannot be authenticated with given CA certificates
Obviously, just removing the "s" from https does not work:
sucupira2 = "http://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"
read_html(sucupira2) #CONTINUA NAO FUNCIONANDO
In the RCurl
, a successful attempt would be so:
library(RCurl)
getURL(sucupira) # NAO FUNCIONA
options(RCurlOptions =
list(capath = system.file("CurlSSL",
"cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
getURL(sucupira) # AGORA FUNCIONA
In the httr
would be so:
library(httr)
GET(sucupira) # NAO FUNCIONA
set_config( config( ssl_verifypeer = 0L ))
GET(sucupira) # AGORA FUNCIONA
My purpose is to learn how to use the rvest
. So I would not like, if possible, to use:
read_html(GET(sucupira)) # a resposta do comando GET do httr é
# passada para o read_html do rvest
You may consider to "access a page" (
httr
) is a different task from "manipulating an html file" (rvest
). My web scrapers usually contain the sequencehttr::GET(x) %>% httr::content('text') %>% xml2::read_html() %>% rvest::html_XXX()
– Julio Trecenti