Programmatically generate and download links

Asked

Viewed 1,852 times

5

There is a database of the National Water Agency that can be accessed by Hydroweb.

To download is just access:

Hydrological Data > Historical Series

and include the desired rain station code.

When chosen the code and type of the post the site forwards to a page where you can download in MSAccess or text.

It is worth noting that this link is the same for all codes of posts, varying only the part Codigo="", where the code of other stations to be lowered.

As I have hundreds of posts to download, it is very expensive to download one by one. I would like to perform the download by performing a loop.

However, the link mentioned above is not the one used to insert into the function download.file(), because it does not redirect directly to the download. The redirect link is the one generated when choosing the file type to be downloaded MSAccess or .txt and a link appears "Click here". It is generated then a link type: http://hidroweb.ana.gov.br/ARQ/A20150425-173906-352/CHUVAS.ZIP, where A20150425-173906 is the date and time of when the link was accessed (I don’t know the meaning of -352).

Would anyone know how I could download with an R code?

  • 1

    Bruno, where would you have the station codes? Using Python you can easily get a script to run this "Crawler" and download all the information you need.

  • Hi Arthur, the posts I would enter with a vector with the number of them I need. So it depends on the area of study. For example, these station codes 2851050, 2751025, 2849035, 2750004, 2650032, 2850015.

  • Bruno, I am also looking for an R function to open the zip files generated by the hydroweb. Did you get it? Thank you. Mauricio Camargo

4 answers

6


Here is an answer in R. You will need the packages httr and XML:

install.packages("httr")
install.packages("XML")

I made the code in a simpler way, without creating functions or putting other parameters than the station code, but with this it should be easy to do the rest. As in Arthur Alvim’s reply, the files will be saved with the station name in the current R workbook.

library(httr)
library(XML)

baseurl <-c("http://hidroweb.ana.gov.br/Estacao.asp?Codigo=", "&CriaArq=true&TipoArq=1")

estacoes <- c(2851050, 2751025, 2849035, 2750004, 2650032, 2850015, 123)

for (est in estacoes){
  r <- POST(url = paste0(baseurl[1], est, baseurl[2]), body = list(cboTipoReg = "10"), encode = "form")
  if (r$status_code == 200) {
    cont <- content(r, as = "text")
    arquivo <- unlist(regmatches(cont, gregexpr("ARQ.+/CHUVAS.ZIP", cont)))
    arq.url <- paste0("http://hidroweb.ana.gov.br/", arquivo)
    download.file(arq.url, paste0(est, ".zip"), mode = "wb")
    cat("Arquivo", est, "salvo com sucesso.\n")
  } else {
    cat("Erro no arquivo", est, "\n")
  }
}

# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005606-786/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6532 bytes
# downloaded 6532 bytes
# 
# Arquivo 2851050 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005607-172/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6734 bytes
# downloaded 6734 bytes
# 
# Arquivo 2751025 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005608-703/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6737 bytes
# downloaded 6737 bytes
# 
# Arquivo 2849035 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005609-783/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 3995 bytes
# downloaded 3995 bytes
# 
# Arquivo 2750004 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005610-492/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 10751 bytes (10 KB)
# downloaded 10 KB
# 
# Arquivo 2650032 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005613-538/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 4625 bytes
# downloaded 4625 bytes
  • 1

    Thank you very much. Looks robust.

  • It worked! Very well!

  • @Molx, while running the code again I got an error message: No encoding supplied: defaulting to UTF-8.. From what I’ve seen it happens to in r<- POST(url = paste0(baseurl[1], est, baseurl[2]), body = list(cboTipoReg = "10"), encode = "form"). I wonder what might be going on?

  • @I’m trying to see the problem, but it seems that the hydroweb system is crashing, and I saw on the website that they are changing to a new platform. I suggest you already go ahead and start checking how you should get the information on the new system, because this script should soon become useless anyway.

  • I agree with you, and I’m still thinking about how I’m going to migrate to the new system. On the question, I think it’s really something with the server answer, I managed to get around the problem by inserting an argument encoding="ISO-8859-1" in content(r, as = "text"). I found this on I took a look at: https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html. Thank you very much.

3

I implemented something in Python that can help you. It will download the file and name it with the rank number. This does not answer the question. It would be good if you provide an example so that maybe someone solves your problem in R.

# hidroweb.py
# -*- coding: utf-8 -*-

# pip install beautifulsoup4
# pip install requests

import requests
import re
import shutil
from bs4 import BeautifulSoup


class Hidroweb(object):

    url_estacao = 'http://hidroweb.ana.gov.br/Estacao.asp?Codigo={0}&CriaArq=true&TipoArq={1}'
    url_arquivo = 'http://hidroweb.ana.gov.br/{0}'

    def __init__(self, estacoes):
        self.estacoes = estacoes

    def montar_url_estacao(self, estacao, tipo=1):
        return self.url_estacao.format(estacao, tipo)

    def montar_url_arquivo(self, caminho):
        return self.url_arquivo.format(caminho)

    def montar_nome_arquivo(self, estacao):
        return u'{0}.zip'.format(estacao)

    def salvar_arquivo_texto(self, estacao, link):
        r = requests.get(self.montar_url_arquivo(link), stream=True)
        if r.status_code == 200:
            with open(self.montar_nome_arquivo(estacao), 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
            print '** %s ** (baixado)' % (estacao, )
        else:
            print '** %s ** (problema)' % (estacao, )

    def obter_link_arquivo(self, response):
        soup = BeautifulSoup(response.content)
        return soup.find('a', href=re.compile('^ARQ/'))['href']

    def executar(self):
        post_data = {'cboTipoReg': '10'}

        for est in self.estacoes:
            print '** %s **' % (est, )
            r = requests.post(self.montar_url_estacao(est), data=post_data)
            link = self.obter_link_arquivo(r)
            self.salvar_arquivo_texto(est, link)
            print '** %s ** (concluído)' % (est, )

if __name__ == '__main__':
    estacoes = ['2851050', '2751025', '2849035', '2750004', '2650032',
                '2850015', ]
    hid = Hidroweb(estacoes)
    hid.executar()

# saída
# ** 2851050 **
# ** 2851050 ** (baixado)
# ** 2851050 ** (concluído)
# ** 2751025 **
# ** 2751025 ** (baixado)
# ** 2751025 ** (concluído)
# ** 2849035 **
# ** 2849035 ** (baixado)
# ** 2849035 ** (concluído)
# ** 2750004 **
# ** 2750004 ** (baixado)
# ** 2750004 ** (concluído)
# ** 2650032 **
# ** 2650032 ** (baixado)
# ** 2650032 ** (concluído)
# ** 2850015 **
# ** 2850015 ** (baixado)
# ** 2850015 ** (concluído)

https://gist.github.com/arthuralvim/0779dda52e6d56d0d3eb

  • Well, that settles it. Thank you very much Arthur.

2

I would like to add that in both codes, R or Python, you need to make a small change to download flow data and others.

No R needs to change here:

  list(cboTipoReg = "10")

In Python:

  post_data = {'cboTipoReg': '10'}

The point is that hardcode 10 is only for rainfall (RAIN.ZIP). If you want other data follow the dictionary below:

value="8" for Quotas (cm)

value="9" for flow rates (m³/s)

value="12" for Water Quality

value="13" for Download Summary

value="16" for Cross Profile

1

I incremented some steps in Arthur’s script, like changing the work path and implementing the stations by an external file. I’m sorry for any "Gaff", I’m beginner. I hope I’ve helped.

___________________________-> Python - Hidroweb <-______________________________

Autor: Arthur Alvin 25/04/2015
[email protected]

Modificação: Jean Favaretto 16/07/2015
[email protected]

Modificação:Vitor Gustavo Geller 16/07/2015
[email protected]

______________________________-> Comentários <-_________________________________

O script Python HidroWeb foi criado  para automatizar o procedimento de aquisição 
de dados das estações do portal: http://hidroweb.ana.gov.br/

Para utilizar o script deverao ser instaladas as bibliotecas:
-> requests
-> beautifulsoup4 (ou superior)

UTILIZACAO:

Apos a instalacao das bibliotecas cria-se um Arquivo de Entrada, com o numero 
das estacoes. A proxima etapa será inicilizar o script, entao ele abrir uma
janela para selecionar o Arquivo de Entrada. Como saída o HidroWeb - Python, 
retorna duas informacoes. A primeira em tela, contendo a situacao do download. 
Por fim, gera-se no mesmo diretorio do Arquivo de Entrada, os arquivos de cada 
estacao que foi possivel realizar a transferencia (baixada).


ARQUIVO DE ENTRADA:

A entrada deve ser um arquivo *.txt contendo o número das estação a serem 
baixadas, com a seguinte estrutura:
-> O número das estacoes defem ser digitadas linhas apos linhas, 
sem cabecalhos, sem espacos, nem separadores (, . ;).
-> Simplismente um Enter após cada numero de estacao. 

02751025
02849035
02750004
02650032
02850015


SAIDAS:

Situação das transferencias em Tela:
** 02851050 **
** 02851050 ** (baixado)
** 02851050 ** (concluído)

No diretorio do Arquivo de Entrada serao criados os arquivos de saida contendo
a informacao disponivel de cada estacao baixada.

OBS: Tenha certeza que todos numeros das estacao existam, caso contrario da 
"BuuuG".
Palavras chave: HidroWeb, ANA, Estacoes, Pluviometricas, Fluviometricas,
Precipitacao, Vazao, Cotas, baixar, download. 
"""

# ********  DECLARACOES INICIAIS
import os
import Tkinter, tkFileDialog
import sys
import requests
import re
import shutil
from bs4 import BeautifulSoup

# By Vitor

# ABRE ARQUIVO DE ENTRADA
root    = Tkinter.Tk()
entrada = tkFileDialog.askopenfile(mode='r')    
root.destroy()

#****************---------------correcao de bug--------------********************
if (entrada == None): 
    sair = raw_input('\tArquivo de entrada nao selecionado. \n\t\tPressione enter para sair.\n')
    sys.exit()
#****************---------------fim da correcao--------------********************

pathname = os.path.dirname(entrada.name) #define o path de trabalho igual ao do arquivo de entrada
os.chdir(pathname)  #muda caminho de trabalho.

VALORES = []

# By Jean

while True:

    conteudo_linha = entrada.read().split("\n")
    VALORES.append(conteudo_linha)

    if (len(conteudo_linha) <= 1):
        break

print VALORES, "\n"


#### By Arthur

class Hidroweb(object):

    url_estacao = 'http://hidroweb.ana.gov.br/Estacao.asp?Codigo={0}&CriaArq=true&TipoArq={1}'
    url_arquivo = 'http://hidroweb.ana.gov.br/{0}'

    def __init__(self, estacoes):
        self.estacoes = estacoes

    def montar_url_estacao(self, estacao, tipo=1):
        return self.url_estacao.format(estacao, tipo)

    def montar_url_arquivo(self, caminho):
        return self.url_arquivo.format(caminho)

    def montar_nome_arquivo(self, estacao):
        return u'{0}.zip'.format(estacao)

    def salvar_arquivo_texto(self, estacao, link):
        r = requests.get(self.montar_url_arquivo(link), stream=True)
        if r.status_code == 200:
            with open(self.montar_nome_arquivo(estacao), 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
            print '** %s ** (baixado)' % (estacao, )
        else:
            print '** %s ** (problema)' % (estacao, )

    def obter_link_arquivo(self, response):
        soup = BeautifulSoup(response.content)
        return soup.find('a', href=re.compile('^ARQ/'))['href']

    def executar(self):
        post_data = {'cboTipoReg': '10'}

        for est in self.estacoes:
            print '** %s **' % (est, )
            r = requests.post(self.montar_url_estacao(est), data=post_data)
            link = self.obter_link_arquivo(r)
            self.salvar_arquivo_texto(est, link)
            print '** %s ** (concluído)' % (est, )

if __name__ == '__main__':
    estacoes = VALORES[::1][0]
    hid = Hidroweb(estacoes)
    hid.executar() `

Browser other questions tagged

You are not signed in. Login or sign up in order to post.