Why does rvest break when processing an empty file?


Viewed 66 times


When trying to process the contents of an empty file the package rvest locks and closes the RStudio. Follows small reproduction of the problem:

tf <- tempfile()
html_erro <- read_html(tf)
html_erro %>% html_nodes('h1') %>% html_text() 

Why is the error (nonexistent file) treated this way? Why R closes instead of an error message appears?

Thank you!

  • 1

    I believe this is a bug in the package xml2 see what the mistake happens when you do xml2::read_html(tf). I think you should report it here: https://github.com/hadley/xml2/issues

  • I have reported on rvest, here. Until then it doesn’t hurt to try a question in the :P OS

  • 1

    I just don’t know if the problem is really in the rvest, as it is just a wrapper of xml2, it is much more likely that the problem is in the xml2.

1 answer


I will answer only the part: Why the mistake happens?

When you read an empty file with the function read_html package xml2 using the code below:

tf <- tempfile()
html_erro <- read_html(tf)

You get a list of two elements with the class externalptr. This can be observed with:

List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Now let’s look at each of these objects from the list. First the $doc:

<pointer: 0x128c4d4c0>

See that it is a pointer to this memory address: 0x128c4d4c0. Now look at the object $node:

<pointer: 0x0> 

He’s a pointer to the address 0x0. Here the problem will happen. When at some point your program attempts to access the value of this pointer, it will attempt to access a null/nonexistent memory address, causing what is called Segmentation fault or Segmentation Failure.

In your case, the function html_nodes tried to access this address and found the problem, but it could happen for example when you do print(html_erro), here the function method print for xml_doc tries to access this tip and causes segmentation failure.

  • So the problem is not with the R, but with the code in C++?

  • Not necessarily. But probably R is here: https://github.com/hadley/xml2/blob/18c8baa9f4f508e769efb5a01302a3e14a99895e/src/xml2_doc.cpp This function doc_parse_raw should perhaps treat this file exception to be empty

Browser other questions tagged

You are not signed in. Login or sign up in order to post.