How to make Scrapping in Google Scholar?

Asked

Viewed 240 times

3

Using Mozilla Firefox, could anyone tell you how to make Scrapping in Google Scholar? Where to start?

  • 2

    Karla, hello! Thank you for sharing. Can I suggest a change? It might be interesting to change the format to a question/answer pair. This helps those who have a question that is answered by its content ("How to do Scrapping in Google Scholar?") while adjusting to the format of Stack Overflow.

  • The site has the format Question/Answer. Preferably with some current code. In your case, I put an answer to your own question.

  • You should put your answer, then I will delete my "answer". I did just to indicate how it will look.

1 answer

3

I will publish here my web code scraping for google scholar, using keywords. With this code I was able to obtain information such as: Title, Authors, Abstract and number of citations; from the google scholar page. The code goes based on information obtained by the following R programmers: Kay Cichini, Gabor Pozsgai and Rogério Barbosa.

library(RSelenium)
library(xlsx)

checkForServer() #baixando um servidor do Selenium (so precisa fazer uma vez)
startServer() # mantenha essa janela aberta

Get Mozilla firefox installed!! Let’s open it:

firefox_con <- remoteDriver(remoteServerAddr = "localhost", 
                            port = 4444, 
                            browserName = "firefox"
)

Opening firefox (browsing will happen on it)

firefox_con$open() # mantenha essa janela aberta

Performs the Scrapping

url <- paste("http://scholar.google.com/scholar?q=", "+key+word", "&num=1&as_sdt=1&as_vis=1", 
             sep = "")

firefox_con$navigate("http://scholar.google.com")
busca <- firefox_con$findElement(using = "css selector", value = "#gs_hp_tsi")
Keyword <- busca$sendKeysToElement(list("key word", key="enter"))

pages.max <- 10

scraper_internal <- function(x) {
  doc <- htmlParse(url, encoding="UTF-8")
  tit <- xpathSApply(x, "//h3[@class='gs_rt']", xmlValue)
  aut <- xpathSApply(x, "//div[@class='gs_a']", xmlValue)
  abst <- xpathSApply(x, "//div[@class='gs_rs']", xmlValue)
  others <- xpathSApply(x, "//div[@class='gs_fl']", xmlValue)
  dat <- data.frame(TITLE = tit, AUTHORS = aut, ABSTRACT = abst, CITED = others)
}

for (i in seq(1,pages.max*10,10)){
    baseURL <- paste("http://scholar.google.com/scholar?start=", i, "&q=", "+key+word",
                   "&hl=en&lr=lang_en&num=10&as_sdt=1&as_vis=1",
                   sep = "")
  firefox_con$navigate(baseURL)
  pagina <- xmlRoot(htmlParse(
    unlist(firefox_con$getPageSource())
  ))
  result <- scraper_internal(pagina)
  write.xlsx(result, "C:/KEYWORD.xlsx", 
             sheetName = paste("keyword", i), row.names=TRUE, col.names = TRUE, append=TRUE)
}
  • 2

    Karla, I moved the content that was in the answer that Tony did in his place here.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.