Why does rvest break when processing an empty file?

Asked

Viewed 66 times

4

When trying to process the contents of an empty file the package rvest locks and closes the RStudio. Follows small reproduction of the problem:

tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)
html_erro %>% html_nodes('h1') %>% html_text() 

Why is the error (nonexistent file) treated this way? Why R closes instead of an error message appears?

Thank you!

  • 1

    I believe this is a bug in the package xml2 see what the mistake happens when you do xml2::read_html(tf). I think you should report it here: https://github.com/hadley/xml2/issues

  • I have reported on rvest, here. Until then it doesn’t hurt to try a question in the :P OS

  • 1

    I just don’t know if the problem is really in the rvest, as it is just a wrapper of xml2, it is much more likely that the problem is in the xml2.

1 answer

3


I will answer only the part: Why the mistake happens?

When you read an empty file with the function read_html package xml2 using the code below:

tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)

You get a list of two elements with the class externalptr. This can be observed with:

str(html_erro)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Now let’s look at each of these objects from the list. First the $doc:

html_erro$doc
<pointer: 0x128c4d4c0>

See that it is a pointer to this memory address: 0x128c4d4c0. Now look at the object $node:

html_erro$node
<pointer: 0x0> 

He’s a pointer to the address 0x0. Here the problem will happen. When at some point your program attempts to access the value of this pointer, it will attempt to access a null/nonexistent memory address, causing what is called Segmentation fault or Segmentation Failure.

In your case, the function html_nodes tried to access this address and found the problem, but it could happen for example when you do print(html_erro), here the function method print for xml_doc tries to access this tip and causes segmentation failure.

  • So the problem is not with the R, but with the code in C++?

  • Not necessarily. But probably R is here: https://github.com/hadley/xml2/blob/18c8baa9f4f508e769efb5a01302a3e14a99895e/src/xml2_doc.cpp This function doc_parse_raw should perhaps treat this file exception to be empty

Browser other questions tagged

You are not signed in. Login or sign up in order to post.