Web scraping for scientific papers collection at Sciencedirect

Question

Web scraping for scientific papers collection at Sciencedirect

Asked 10 years, 9 months ago

Viewed 460 times

5

I am trying to use R to select articles from the Sciencedirect page using Keywords. I was able to extract pdfs from a page last week, using the source information of the page. The code used was the following:

base.url = "http"
doc.html <- htmlParse(base.url)
doc.links <- xpathSApply(doc.html, "//a/@href")
pdf.url <- doc.links[grep("http:/", doc.links)]
dat<-as.data.frame(pdf.url)
colnames(dat)<-"url"
dat$pdf<-unlist(lapply(dat$url, FUN = function(x) strsplit(x, "/")[[1]][3]))
lapply(dat$pdf, function(x)
download.file(paste("http//pdf/", x, sep=""), 
paste(download.folder, x, sep=""), mode = "wb", cacheOK=TRUE))

Does anyone have any suggestions on how I can do the same for Science Direct?

I think extracting the Pdfs violates the terms of use of that site, right? Content is not paid?

– bfavaretto

2014/10/13 at 15:47
Yes, but I have vpn (permission to remotely collect the articles) from USP (I am a student).

– Karla Sessin Dilascio

2014/10/13 at 15:51

1 answer

Browser other questions tagged r pdf

You are not signed in. Login or sign up in order to post.

by RogerioJB • **996** points · Answer 1 · 2014-10-13T15:52:19+00:00

I have a suggestion using the Rselenium and XML packages. Rselenium controls an internet browser (in this case firefox) and allows you to automatically navigate by command line. This is very advantageous for pages with many complex code and Javascript. It is not the easiest solution however. I believe someone can post an example here using the package rvest...

Come on then:

Installing...

    #install.packages("devtools")
    #install.packages("RCurl",dep=T)
    #install.packages("XML",dep=T)
    #install.packages("RJSONIO",dep=T)

    #library(devtools)
    #install_github("ropensci/RSelenium")

Now we load the Rselenium package

    library(RSelenium)

And we installed a Java/Selenium server to control Firefox. It is a program that is open together with R and serves as a "translation" interface between R and browser.

    checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
    startServer() # mantenha essa janela aberta

Get Mozilla firefox installed!! Let’s open it:

    firefox_con <- remoteDriver(remoteServerAddr = "localhost", 
                                port = 4444, 
                                browserName = "firefox"
    )

Opening firefox (browsing will happen on it)

    firefox_con$open() # mantenha essa janela aberta

    # Definindo a pagina de interesse
    url <- "http://www.sciencedirect.com"

We navigate to the page of interest in firefox

    firefox_con$navigate("http://www.sciencedirect.com")

And insert the search term ("Biology") into the text box. Then press ENTER to perform the search:

    busca <- firefox_con$findElement(using = "css selector", "#qs_all")
    busca$sendKeysToElement(list("Biology", key="enter"))

Now the rest is with XML:

    # Extraindo o codigo fonte da pagina
    pagina <- xmlRoot(
                    htmlParse(
                            unlist(firefox_con$getPageSource())
                    )) 

    # Extraindo os links para os PDF (alguns deles podem requerer acesso pago...)
    pdf_links <- xpathSApply(pagina, '//span[@class="pdfIconSmall"]/..', xmlGetAttr, "href")
    links_incompletos <- grep("^/", pdf_links)
    pdf_links[links_incompletos] <- paste0(url,pdf_links[links_incompletos])

    # Seus links
    pdf_links

    # links que funcionam (gratuitos)
    pdf_gratis <- pdf_links[grep("article",pdf_links)]

    # DOI (o DOI será o nome do arquivo salvo)
    DOI <- substr(pdf_gratis,50,66)

   # Fazendo o download
   ### setwd... defina um diretorio...

    for(i in 1:length(pdf_gratis)){
            download.file(pdf_gratis[i], 
                          paste0(DOI[i],".pdf"),
                          mode = "wb")
    }

I hope it helped.