Web scraping on application websites in R

Asked

Viewed 182 times

1

I would like to know some ideas to get the data from this link: http://reportarios.aneel.gov.br/_layouts/xlviewer.aspx? id=/Reportsalls/Relsampclasseconsnivel.xlsx&Source=http%3A%2F%2FReports%2Eaneel%2Egov%2Ebr%2FRelatorsSAS%2FForms%2FAllItems%2Easpx&Defaultitemopen=1

I used the code below to try to extract the data. But when I enter CSS of what I want to extract, it returns the empty html nodes.

rm(list = ls())
#install.packages("rvest")
library(rvest)

url <- paste0("http://relatorios.aneel.gov.br/_layouts/xlviewer.aspx?id=/RelatoriosSAS/RelSAMPClasseConsNivel.xlsx&Source=http%3A%2F%2Frelatorios%2Eaneel%2Egov%2Ebr%2FRelatoriosSAS%2FForms%2FAllItems%2Easpx&DefaultItemOpen=1")

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.cv-nwl')

I would like to emphasize that the site is not in the standard format that we usually have for data extraction from the internet, ie html. It looks something like an application. Someone would have an idea of how to extract this data?

  • This resolves http://answall.com/q/109475/3635 ?

  • This data seems to be inside an Excel spreadsheet. It is not possible to download the spreadsheet directly?

  • It is possible, the problem is that there is a form up there that you give what you want.

  • Please explain what you tried and the problems you encountered.

  • This helps at least in the beginning http://answall.com/q/109475/3635? I am voting too widely because there is not much to do and no code or attempt has been presented. It’s not necessarily a duplicate.

  • It would not be possible to download the file .xlsx and parse the same?

  • In fact, the point is that I would like to get the results of each month for each year. Then I would have to download several xlsx as it is a new generated file.

  • @Guilhermebirth link solution pt.stackoverflow.com/q/109475/3635 does not help, because the page I would like to scrape is like an Excel application. I would have to enter the form data to extract the values. I would have another explanation?

  • Wow, a year ago :) ... so what you want is to read the columns and rows of a spreadsheet then this package should solve your problem: https://github.com/tidyverse/readxl

  • No, actually excel opens in this link I sent earlier as if it were a website on the internet.

  • @morebru that there is dynamically populated via Ajax, there will be nothing to do there, unless you use something like Phantomjs, but it is a lot of work that can be solved otherwise, if you select file and download will then generate this link: https://pastebin.com/raw/XFAUYbtZ (the link that is within this link, is that does not fit in the comments)

Show 6 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.