How to make webscrapping of an https using rvest?

Question

How to make webscrapping of an https using rvest?

Asked 9 years, 8 months ago

Viewed 686 times

7

I would like to shave a page that is in https using the package rvest. However, it is a website with problems in the security certificate. In such cases, you need to turn off the SSL verification -- but I don’t know how to do this in this package. In the RCurl and in the httr is very simple. I give some examples below

That’s the page I intend to scrape:

sucupira = "https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"

That’s what I’m trying to do:

library(rvest)
read_html(sucupira) #NAO FUNCIONA
 ##  Error in open.connection(x, "rb") : 
 ##  Peer certificate cannot be authenticated with given CA certificates

Obviously, just removing the "s" from https does not work:

sucupira2 = "http://sucupira.capes.gov.br/sucupira/public/consultas/coleta/producaoIntelectual/listaProducaoIntelectual.jsf"

read_html(sucupira2) #CONTINUA NAO FUNCIONANDO

In the RCurl, a successful attempt would be so:

library(RCurl)
getURL(sucupira) # NAO FUNCIONA

options(RCurlOptions = 
      list(capath = system.file("CurlSSL", 
                                "cacert.pem", 
                                package = "RCurl"), 
           ssl.verifypeer = FALSE))

getURL(sucupira) # AGORA FUNCIONA

In the httr would be so:

library(httr)
GET(sucupira) # NAO FUNCIONA

set_config( config( ssl_verifypeer = 0L ))
GET(sucupira) # AGORA FUNCIONA

My purpose is to learn how to use the rvest. So I would not like, if possible, to use:

read_html(GET(sucupira)) # a resposta do comando GET do httr é
                         # passada para o read_html do rvest

1

You may consider to "access a page" (httr) is a different task from "manipulating an html file" (rvest). My web scrapers usually contain the sequence httr::GET(x) %>% httr::content('text') %>% xml2::read_html() %>% rvest::html_XXX()

– Julio Trecenti

2015/12/17 at 00:30

2 answers

4

This does not seem possible using the package rvest.

Reading the source code, we see that the function read_html is a function wrapper read_xml. The source code is available in this link.

The function read_xml uses some method depending on the type of input, which may be character, raw or connection.

When we pass a URL to the function read_xml, it converts it to a connection and then reads it as a raw.

Below is the method for function connections read_xml

read_xml.connection <- function(x, encoding = "", n = 64 * 1024,
                                verbose = FALSE, ..., base_url = "",
                                as_html = FALSE) {
  if (!isOpen(x)) {
    open(x, "rb")
    on.exit(close(x))
  }

  raw <- read_connection_(x, n)
  read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html)
}

See that it uses the function open of base.

Of the help of open we read:

Note that the https://URL Scheme is not supported by the Internal method except on Windows. There it is only supported if --Internet2 or setInternet2(TRUE) was used (to make use of Windows Internal functions), and then only if the Certificate is considered to be Valid. With that option only, the http://user:pass@site Notation for sites requiring Authentication is also accepted.

That is, https is only supported in windows, if setInternet2(TRUE) used before. In this case, it would only work if the certificate was valid.

All this p/ explain that there is no native form, or a simple change of argument in the rvest which allows you to read https pages.

I believe the best method is really read_html(GET(sucupira))that you didn’t even suggest. Or more cute:

GET(sucupira) %>% read_html()

If in the function method read_xml.connection you changed the line open(x, "rb") for url(x,"rb", method = "libcurl") it was likely to work...

Browser other questions tagged r ssl certified web-scraping rvest

You are not signed in. Login or sign up in order to post.

by momenezes • 11 points · Answer 1 · 2020-12-21T16:29:05+00:00

Hello,

Although a lot of time has passed, it is possible to disable the certificate check on httr indeed also in the rvest.

I did so:

library(httr)

set_config( config( ssl_verifypeer = 0L ))

# agora acesso o site
url <- 'https://cei.b3.com.br/CEI_Responsivo/'
sessao <- html_session(url)

Thus the rvest did not complain about the certificate.

HTH,