Error when trying to extract table from a site by R, how to resolve?

Asked

Viewed 215 times

4

I’m using this code, I want to import the country table to the R:

library(XML)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
country_data <- readHTMLTable(url, which=2)

R returns the error:

Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_countries_by_population"

How to proceed?

1 answer

4


My little experience with web scraping on R made me like the package more rvest than of XML to do this kind of work. So I’ll give you a solution with it instead of a solution with the package you wanted:

library(rvest)

url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tabela <- read_html(url) %>%
  html_table(fill=TRUE) %>%
  .[[2]]

The only trick here is knowing how to identify the position of the table that interests you within what was downloaded from the internet. In the specific case of the address present in the object url, the table we are interested in is in the position [[2]].

As far as I know, the only way to find the right number is by trial and error. Maybe there’s another way, but I don’t know.


If the above code generates the error Error in open.connection(x, "rb") : Timeout was reached, try to run the command below:

library(rvest)
library(curl)

url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tabela <- read_html(curl(url, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
  html_table(fill=TRUE) %>%
  .[[2]]

When using the curl, we force the scraper to identify to the site. Thus, the site does not refute the connection that the R tries to make.

  • When I use this command, this error appears: Error in open.Connection(x, "Rb") : Timeout was reached

  • Try the new code, below the horizontal line.

  • The error persists, this may have to do with my current version of R? 3.3.0, when packages were made in 3.3.2?

  • Maybe. I tested both codes on R 3.3.2, Mac and Linux, and it worked on both. Updates R and tries again.

  • I tested in my house in version 3.3.0 and worked well. I believe it is something related to firewall, but I do not know how to solve.

  • Unfortunately, without being able to reproduce exactly the conditions of the place where the code didn’t work, I can’t help you. So, given that my answer worked, could you accept it? Thus, people who have the same question in the future will know that this answer is correct and answers the question asked.

  • 1

    I didn’t know how to do this, but already accept, thanks friend!

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.