Web scraping for scientific papers collection at Sciencedirect

Asked

Viewed 460 times

5

I am trying to use R to select articles from the Sciencedirect page using Keywords. I was able to extract pdfs from a page last week, using the source information of the page. The code used was the following:

base.url = "http"
doc.html <- htmlParse(base.url)
doc.links <- xpathSApply(doc.html, "//a/@href")
pdf.url <- doc.links[grep("http:/", doc.links)]
dat<-as.data.frame(pdf.url)
colnames(dat)<-"url"
dat$pdf<-unlist(lapply(dat$url, FUN = function(x) strsplit(x, "/")[[1]][3]))
lapply(dat$pdf, function(x)
download.file(paste("http//pdf/", x, sep=""), 
paste(download.folder, x, sep=""), mode = "wb", cacheOK=TRUE))

Does anyone have any suggestions on how I can do the same for Science Direct?

  • I think extracting the Pdfs violates the terms of use of that site, right? Content is not paid?

  • Yes, but I have vpn (permission to remotely collect the articles) from USP (I am a student).

1 answer

6

I have a suggestion using the Rselenium and XML packages. Rselenium controls an internet browser (in this case firefox) and allows you to automatically navigate by command line. This is very advantageous for pages with many complex code and Javascript. It is not the easiest solution however. I believe someone can post an example here using the package rvest...

Come on then:

Installing...

    #install.packages("devtools")
    #install.packages("RCurl",dep=T)
    #install.packages("XML",dep=T)
    #install.packages("RJSONIO",dep=T)

    #library(devtools)
    #install_github("ropensci/RSelenium")

Now we load the Rselenium package

    library(RSelenium)

And we installed a Java/Selenium server to control Firefox. It is a program that is open together with R and serves as a "translation" interface between R and browser.

    checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
    startServer() # mantenha essa janela aberta

Get Mozilla firefox installed!! Let’s open it:

    firefox_con <- remoteDriver(remoteServerAddr = "localhost", 
                                port = 4444, 
                                browserName = "firefox"
    )

Opening firefox (browsing will happen on it)

    firefox_con$open() # mantenha essa janela aberta

    # Definindo a pagina de interesse
    url <- "http://www.sciencedirect.com"

We navigate to the page of interest in firefox

    firefox_con$navigate("http://www.sciencedirect.com")

And insert the search term ("Biology") into the text box. Then press ENTER to perform the search:

    busca <- firefox_con$findElement(using = "css selector", "#qs_all")
    busca$sendKeysToElement(list("Biology", key="enter"))

Now the rest is with XML:

    # Extraindo o codigo fonte da pagina
    pagina <- xmlRoot(
                    htmlParse(
                            unlist(firefox_con$getPageSource())
                    )) 

    # Extraindo os links para os PDF (alguns deles podem requerer acesso pago...)
    pdf_links <- xpathSApply(pagina, '//span[@class="pdfIconSmall"]/..', xmlGetAttr, "href")
    links_incompletos <- grep("^/", pdf_links)
    pdf_links[links_incompletos] <- paste0(url,pdf_links[links_incompletos])

    # Seus links
    pdf_links

    # links que funcionam (gratuitos)
    pdf_gratis <- pdf_links[grep("article",pdf_links)]

    # DOI (o DOI será o nome do arquivo salvo)
    DOI <- substr(pdf_gratis,50,66)

   # Fazendo o download
   ### setwd... defina um diretorio...

    for(i in 1:length(pdf_gratis)){
            download.file(pdf_gratis[i], 
                          paste0(DOI[i],".pdf"),
                          mode = "wb")
    }

I hope it helped.

  • Very grateful for your response! The steps went super well. I just haven’t been able to write the pdf files in my work.

  • @Karla, with the loop above, I can download/save the Pdfs normally... weird. Didn’t this part of the code work for you? Was there an error? In the latter case, which message returns?

  • Rogériojb, it saves a pdf file that should contain all downloaded pdfs, however the file is blank.

  • So, actually, there is no error message in R, only the file saved blank.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.