I will answer only the part: Why the mistake happens?
When you read an empty file with the function read_html
package xml2
using the code below:
tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)
You get a list of two elements with the class externalptr
. This can be observed with:
str(html_erro)
List of 2
$ node:<externalptr>
$ doc :<externalptr>
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Now let’s look at each of these objects from the list. First the $doc
:
html_erro$doc
<pointer: 0x128c4d4c0>
See that it is a pointer to this memory address: 0x128c4d4c0
.
Now look at the object $node
:
html_erro$node
<pointer: 0x0>
He’s a pointer to the address 0x0
. Here the problem will happen. When at some point your program attempts to access the value of this pointer, it will attempt to access a null/nonexistent memory address, causing what is called Segmentation fault or Segmentation Failure.
In your case, the function html_nodes
tried to access this address and found the problem, but it could happen for example when you do print(html_erro)
, here the function method print
for xml_doc
tries to access this tip and causes segmentation failure.
I believe this is a bug in the package
xml2
see what the mistake happens when you doxml2::read_html(tf)
. I think you should report it here: https://github.com/hadley/xml2/issues– Daniel Falbel
I have reported on
rvest
, here. Until then it doesn’t hurt to try a question in the :P OS– Tomás Barcellos
I just don’t know if the problem is really in the
rvest
, as it is just a wrapper ofxml2
, it is much more likely that the problem is in thexml2
.– Daniel Falbel