I have a suggestion using the Rselenium and XML packages. Rselenium controls an internet browser (in this case firefox) and allows you to automatically navigate by command line. This is very advantageous for pages with many complex code and Javascript. It is not the easiest solution however. I believe someone can post an example here using the package rvest
...
Come on then:
Installing...
#install.packages("devtools")
#install.packages("RCurl",dep=T)
#install.packages("XML",dep=T)
#install.packages("RJSONIO",dep=T)
#library(devtools)
#install_github("ropensci/RSelenium")
Now we load the Rselenium package
library(RSelenium)
And we installed a Java/Selenium server to control Firefox. It is a program that is open together with R and serves as a "translation" interface between R and browser.
checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
startServer() # mantenha essa janela aberta
Get Mozilla firefox installed!! Let’s open it:
firefox_con <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox"
)
Opening firefox (browsing will happen on it)
firefox_con$open() # mantenha essa janela aberta
# Definindo a pagina de interesse
url <- "http://www.sciencedirect.com"
We navigate to the page of interest in firefox
firefox_con$navigate("http://www.sciencedirect.com")
And insert the search term ("Biology") into the text box. Then press ENTER to perform the search:
busca <- firefox_con$findElement(using = "css selector", "#qs_all")
busca$sendKeysToElement(list("Biology", key="enter"))
Now the rest is with XML:
# Extraindo o codigo fonte da pagina
pagina <- xmlRoot(
htmlParse(
unlist(firefox_con$getPageSource())
))
# Extraindo os links para os PDF (alguns deles podem requerer acesso pago...)
pdf_links <- xpathSApply(pagina, '//span[@class="pdfIconSmall"]/..', xmlGetAttr, "href")
links_incompletos <- grep("^/", pdf_links)
pdf_links[links_incompletos] <- paste0(url,pdf_links[links_incompletos])
# Seus links
pdf_links
# links que funcionam (gratuitos)
pdf_gratis <- pdf_links[grep("article",pdf_links)]
# DOI (o DOI será o nome do arquivo salvo)
DOI <- substr(pdf_gratis,50,66)
# Fazendo o download
### setwd... defina um diretorio...
for(i in 1:length(pdf_gratis)){
download.file(pdf_gratis[i],
paste0(DOI[i],".pdf"),
mode = "wb")
}
I hope it helped.
I think extracting the Pdfs violates the terms of use of that site, right? Content is not paid?
– bfavaretto
Yes, but I have vpn (permission to remotely collect the articles) from USP (I am a student).
– Karla Sessin Dilascio