Web Scraping with Rvest - Problem with data scraping (links mainly)

Question

Web Scraping with Rvest - Problem with data scraping (links mainly)

Asked 4 years, 1 month ago

Viewed 43 times

-2

I am trying to make a Web Scraping of the Web Site Of Science, but esstou facing problems with scraping links from the site.

My intention is to scrape titles from the articles, links that direct to each page of the article within the Web of Science so that I can scrape other data such as: abstract, keywords, among others. And finally make a looping to scrape this information to the last page of the search.

I started with the following code:

library(rvest)
library(dplyr)

link <- paste0("https://apps.webofknowledge.com/Search.do?", 
    "product=WOS&SID=5Bzr6AeuFKanEWXAFWh&search_mode=GeneralSearch&",
    "prID=45752f34-12a3-474b-8fcf-6b21a2196ed7")

page <- read_html(link)

titulo_artigo <- page %>%
  html_nodes(".snowplow-full-record value") %>%
  html_text()

links_dos_artigos <-page %>%
  html_nodes(".snowplow-full-record value") %>%
  html_attr("href")

Meanwhile, links_dos_artigos return only NA values and not the links I need

I’d appreciate it if someone could help.

Welcome to Stackoverflow in Portugal! Dheynne, the link you passed is from an address under login, so you need to pass the parameters to access this page. I don’t know if it still works, but he tested the package wosr?

– Daniel Ikenaga

2021/06/09 at 05:36
I haven’t tested this wosr package yet, so I’ll take a look at it. In relation to the link really it has restricted access, here I can access through the internet of the Federal Educational Institution (free access). I will take a look at the package mentioned. Thank you very much, Daniel Kenaga!

– DHEYNNE ALVES

2021/06/10 at 21:42

1 answer

Browser other questions tagged r web-scraping rvest

You are not signed in. Login or sign up in order to post.

by Willian Caetano • 1 point · Answer 1 · 2021-06-11T02:13:41+00:00

I have the same problem but the site is not restricted.

    html <- "https://www.fujioka.com.br/581992"

    LinkProduto <- read_html(html) %>% 
    html_node("body > main > section.vitrine > div.vitrine__wrapper > div > ul > li > article > div.productCard__info > a") %>% 
    html_attr("href")

He always returns

    {xml_missing}
    <NA>

I have a table with several product codes and I need to get their link.

I can do using Rselenium, but it takes too long. I want to do with rvest to gain time.