Web scraping on application websites in R

Question

Web scraping on application websites in R

Asked 8 years, 5 months ago

Viewed 182 times

1

I would like to know some ideas to get the data from this link: http://reportarios.aneel.gov.br/_layouts/xlviewer.aspx? id=/Reportsalls/Relsampclasseconsnivel.xlsx&Source=http%3A%2F%2FReports%2Eaneel%2Egov%2Ebr%2FRelatorsSAS%2FForms%2FAllItems%2Easpx&Defaultitemopen=1

I used the code below to try to extract the data. But when I enter CSS of what I want to extract, it returns the empty html nodes.

rm(list = ls())
#install.packages("rvest")
library(rvest)

url <- paste0("http://relatorios.aneel.gov.br/_layouts/xlviewer.aspx?id=/RelatoriosSAS/RelSAMPClasseConsNivel.xlsx&Source=http%3A%2F%2Frelatorios%2Eaneel%2Egov%2Ebr%2FRelatoriosSAS%2FForms%2FAllItems%2Easpx&DefaultItemOpen=1")

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.cv-nwl')

I would like to emphasize that the site is not in the standard format that we usually have for data extraction from the internet, ie html. It looks something like an application. Someone would have an idea of how to extract this data?

This resolves http://answall.com/q/109475/3635 ?

– Guilherme Nascimento

2017/03/20 at 14:26
This data seems to be inside an Excel spreadsheet. It is not possible to download the spreadsheet directly?

– Marcus Nunes

2017/03/20 at 14:39
It is possible, the problem is that there is a form up there that you give what you want.

– morebru

2017/03/20 at 21:18
Please explain what you tried and the problems you encountered.

– Tomás Barcellos

2017/03/21 at 20:16
This helps at least in the beginning http://answall.com/q/109475/3635? I am voting too widely because there is not much to do and no code or attempt has been presented. It’s not necessarily a duplicate.

– Guilherme Nascimento

2017/03/21 at 22:05
It would not be possible to download the file .xlsx and parse the same?

– Woss

2017/03/21 at 22:21
In fact, the point is that I would like to get the results of each month for each year. Then I would have to download several xlsx as it is a new generated file.

– morebru

2017/03/23 at 15:48
@Guilhermebirth link solution pt.stackoverflow.com/q/109475/3635 does not help, because the page I would like to scrape is like an Excel application. I would have to enter the form data to extract the values. I would have another explanation?

– morebru

2018/10/15 at 13:29
Wow, a year ago :) ... so what you want is to read the columns and rows of a spreadsheet then this package should solve your problem: https://github.com/tidyverse/readxl

– Guilherme Nascimento

2018/10/15 at 14:08
No, actually excel opens in this link I sent earlier as if it were a website on the internet.

– morebru

2018/10/15 at 14:33
@morebru that there is dynamically populated via Ajax, there will be nothing to do there, unless you use something like Phantomjs, but it is a lot of work that can be solved otherwise, if you select file and download will then generate this link: https://pastebin.com/raw/XFAUYbtZ (the link that is within this link, is that does not fit in the comments)

– Guilherme Nascimento

2018/10/15 at 19:12

Show 6 more comments

No answers

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.