Programmatically generate links and download content

Asked

Viewed 241 times

5

I would like to know how I would collect data from a website.

The site is http://www.ons.org.br/historico/energia_natural_afluente.aspx . There I have to download all the operational historical data from power generation to Affluent Natural Energy. The problem is that within each data series, you are directed to a page to make the selection of the subsystem (SE/CO, S, NE and N), unit, year, etc. And when the options are selected, page link is not changed so you cannot discriminate to scan automatically.

I wish to make a database with all this information. As I use R a lot, I would like to know an R code for such.

  • If it is just download and if the navigability is simple, you can use seleniumhq, this tool automates browser operations, maybe it can help you.

1 answer

7


You can do this using the package rvest. The following code will help you:

library(rvest)
# criando a sessão de navegação
sessao <- html_session("http://www.ons.org.br/historico/energia_natural_afluente.aspx")
# identificando o formulário que deseja "POSTAR"
form <- sessao %>% html_form()
form <- form[[4]]
# atribuindo os valores aos parâmetros do formulário
values <- set_values(form = form,
                     passo1="SE",
                     passo2a="-1",
                     passo2b="MWmed",
                     passo3a="-1",
                     passo3b="2015",
                     tipo="regiao",
                     passo2="MWmed",
                     passo3="2015",
                     passo4="-1",
                     passo1text="SE",
                     passo2text="MWmed",
                     passo3text="2015",
                     passo4text="-1"
                     )
# submetendo o formulário
resposta <- submit_form(sessao, values)
# obtendo as tabelas da resposta do formulário
tabelas <- resposta %>% html_table(fill = T, header = T)
# identificando a tabela desejada
tabela <- tabelas[[2]]

In the object tabela you will probably find the values you are looking for:

> tabela
        2015
1  Jan 21466
2  Fev 34907
3  Mar 43126
4  Abr 37029
5  Mai 30293
6  Jun 23248
7  Jul 28362
8  Ago 16195
9  Set 21010
10 Out 19459
11 Nov 32269

Now you just need to map the options you want to get the data and pass them through the function set_values.

  • Perfect. Thank you very much. I don’t know how to handle html yet, but I will study the package.

  • When using the submit_form function, the x in the case you used would be what? When I run the code you put it presents some errors, because the x not defined.

  • oops, it’s the sessao. I’ll edit in response

  • Just one thing Daniel Falbel, inside set_values The mapping with pattern id_objeto_html=valor_desejado, that’s it?

  • @Gustavocinque, I’m not sure if it’s the id or the name. What I did was I saw through the browser which is header when sending the p/ server form. In Chrome, I leave the developer window open (F12) and send the form. Next in the tab Network, I see what the form data of the first file received energia_natural_afluente_out.aspx.

  • Got it. It’s really a pretty cool solution. Much less complicated to work with than Webservice in Java.

  • Yes! I think R and python are developing quite a lot on that side of making webscrapping easy. The rvest that I used here is inspired by the famous beautiful soup of python.

  • @Daniel Falbel I tried to run exactly the code you put above, but I did not succeed. The final table appears empty with zero lines. > tabela&#xA;[1] Fale Conosco Mapa do Site Links Úteis &#xA;<0 linhas> (ou row.names de comprimento 0) I tried to change the steps etc., but I succeeded.

  • @Brunomoreno very strange, I just ran again and got the same result. There was no error in the middle?

  • @Daniel Falbel I don’t understand anything. I use R Studio, but I have R on my computer too. When I run in R studio it gives the error I said, but when I run in pure R it does not give error. You know what can be?

Show 5 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.