How to handle errors during web scraping?

Asked

Viewed 259 times

2

Hello, everyone. During the Web Scraping process, I started to come across some errors that occur during the request process. Currently, I have identified 4 types of frequent errors:

    Error in curl::curl_fetch_memory(url, handle = handle) : 
      Timeout was reached: Recv failure: Connection was reset

    Error: Can only save length 1 node sets

    Error in curl::curl_fetch_memory(url, handle = handle) : 
      Could not resolve host: www.tcm.ba.gov.br

    Error in curl::curl_fetch_memory(url, handle = handle) : 
    Timeout was reached: Operation timed out after 20000 milliseconds with 0 bytes received

The latter is intentionally triggered by the timeout function(20), which aims to verify that a request is taking more than 20 seconds to complete, which may be an indication of error during the request.

The point is: How to develop a script/function in R (like trycatch) to do the following routine:

-> If an error occurs during the scraping process (such as one of the 4 above), repeat the same request after 60 seconds, no more than 3 attempts. After 3 attempts, skip (next) to the next "i" of the loop. If you also return error, return print("Consecutive errors during Web Scraping").

(Plus): -> It would be VERY TOP if, instead of the print suggested in the last step above, the script sent a warning via email or Telegram to the "Scraping Web Administrator" (in this case, I), communicating that the script had to be stopped.

OBS1: Consider that the request function (httr::GET) is within a for loop, as in the simplified example below:

 for (i in link) { httr::GET(i, timeout(20))}

OBS2: As I’m not from the IT area, I’m having a hard time understanding how the error handling part of R works and, consequently, using the trycatch function.

Thanks for your help.

1 answer

1


To solve this problem, and other similar, you need to do script error handling. There are basically two ways to do this:

1. tryCatch

2. purrr package


USING THE tryCatch

This is an example of my own practice:

for (i in 1:nrow(pg_amostra)) {

  pagina <- tryCatch({
    pagina = html_session(pg_amostra$http[i]) 
  }, warning = function(w) {
    print('Aviso!')
    Sys.sleep(0.3)
    pagina = html_session(pg_amostra$http[i])
  }, error = function(e) {
    print('Erro')
    texto <- ''
  }, finally = {
    Sys.sleep(0.3)
    texto <- html_nodes(pagina, css = '.corpoTextoLongo') %>%
      html_text()
  })

  print(i)

  assign(paste("texto", i, sep = ""), texto)
}

Here in the table pg_amostra there is information about several pages and the url of these pages is in the column http.

SOLUTION WITH THE PURRR

For the handling of errors I tend to always use the purrr because it is simply much easier. In this example, also from my own practice, I create a function that extracts a piece of text from a page.

pega_texto <- function(url) {

  texto <- read_html(url, encoding = 'iso-8859-1') %>%
  html_nodes(css = '.corpoTextoLongo') %>% html_text()
  pb$tick()
  return(texto)
}

HOWEVER, IF the read_html() fail the entire script can stop. So I will "involve" the function with the possibly() purrr to handle possible errors by simply generating an NA.

pega_texto <- possibly(.f = pega_texto, otherwise = NA_real_, quiet = T)

From now on, if I use this function and some error happens it will always return one NA. Then, storing the results in a table, or I’ll have a text or an NA. Just redo the search ONLY urls that returned NA.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.