2
Hello, everyone. During the Web Scraping process, I started to come across some errors that occur during the request process. Currently, I have identified 4 types of frequent errors:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Recv failure: Connection was reset
Error: Can only save length 1 node sets
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: www.tcm.ba.gov.br
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Operation timed out after 20000 milliseconds with 0 bytes received
The latter is intentionally triggered by the timeout function(20), which aims to verify that a request is taking more than 20 seconds to complete, which may be an indication of error during the request.
The point is: How to develop a script/function in R (like trycatch) to do the following routine:
-> If an error occurs during the scraping process (such as one of the 4 above), repeat the same request after 60 seconds, no more than 3 attempts. After 3 attempts, skip (next) to the next "i" of the loop. If you also return error, return print("Consecutive errors during Web Scraping").
(Plus): -> It would be VERY TOP if, instead of the print suggested in the last step above, the script sent a warning via email or Telegram to the "Scraping Web Administrator" (in this case, I), communicating that the script had to be stopped.
OBS1: Consider that the request function (httr::GET) is within a for loop, as in the simplified example below:
for (i in link) { httr::GET(i, timeout(20))}
OBS2: As I’m not from the IT area, I’m having a hard time understanding how the error handling part of R works and, consequently, using the trycatch function.
Thanks for your help.