My little experience with web scraping on R
made me like the package more rvest
than of XML
to do this kind of work. So I’ll give you a solution with it instead of a solution with the package you wanted:
library(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
tabela <- read_html(url) %>%
html_table(fill=TRUE) %>%
.[[2]]
The only trick here is knowing how to identify the position of the table that interests you within what was downloaded from the internet. In the specific case of the address present in the object url
, the table we are interested in is in the position [[2]]
.
As far as I know, the only way to find the right number is by trial and error. Maybe there’s another way, but I don’t know.
If the above code generates the error Error in open.connection(x, "rb") : Timeout was reached
, try to run the command below:
library(rvest)
library(curl)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
tabela <- read_html(curl(url, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table(fill=TRUE) %>%
.[[2]]
When using the curl
, we force the scraper to identify to the site. Thus, the site does not refute the connection that the R
tries to make.
When I use this command, this error appears: Error in open.Connection(x, "Rb") : Timeout was reached
– Danilo Imbimbo
Try the new code, below the horizontal line.
– Marcus Nunes
The error persists, this may have to do with my current version of R? 3.3.0, when packages were made in 3.3.2?
– Danilo Imbimbo
Maybe. I tested both codes on R 3.3.2, Mac and Linux, and it worked on both. Updates R and tries again.
– Marcus Nunes
I tested in my house in version 3.3.0 and worked well. I believe it is something related to firewall, but I do not know how to solve.
– Danilo Imbimbo
Unfortunately, without being able to reproduce exactly the conditions of the place where the code didn’t work, I can’t help you. So, given that my answer worked, could you accept it? Thus, people who have the same question in the future will know that this answer is correct and answers the question asked.
– Marcus Nunes
I didn’t know how to do this, but already accept, thanks friend!
– Danilo Imbimbo