I will answer only the part: Why the mistake happens?
When you read an empty file with the function read_html package xml2 using the code below:
tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)
You get a list of two elements with the class externalptr. This can be observed with:
str(html_erro)
List of 2
$ node:<externalptr>
$ doc :<externalptr>
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Now let’s look at each of these objects from the list. First the $doc:
html_erro$doc
<pointer: 0x128c4d4c0>
See that it is a pointer to this memory address: 0x128c4d4c0.
Now look at the object $node:
html_erro$node
<pointer: 0x0>
He’s a pointer to the address 0x0. Here the problem will happen. When at some point your program attempts to access the value of this pointer, it will attempt to access a null/nonexistent memory address, causing what is called Segmentation fault or Segmentation Failure.
In your case, the function html_nodes tried to access this address and found the problem, but it could happen for example when you do print(html_erro), here the function method print for xml_doc tries to access this tip and causes segmentation failure.
I believe this is a bug in the package
xml2see what the mistake happens when you doxml2::read_html(tf). I think you should report it here: https://github.com/hadley/xml2/issues– Daniel Falbel
I have reported on
rvest, here. Until then it doesn’t hurt to try a question in the :P OS– Tomás Barcellos
I just don’t know if the problem is really in the
rvest, as it is just a wrapper ofxml2, it is much more likely that the problem is in thexml2.– Daniel Falbel