Web Scraping: How to change the value of a drop down button on a site using R?

Asked

Viewed 135 times

5

I want to create a script in R to read an HTML table. Do this from a static page with the package rvest is easy, the problem is that I have to change the value of two page buttons.

This is the site here. Note that above the graph, it has two buttons: one related to the state (ctl00$cphConteudo$CpDDLEstado) and other related to an agricultural product (ctl00$cphConteudo$CpDDLProduto).

I tried the following code unsuccessfully:

library(rvest)
url <- "http://www.agrolink.com.br/cotacoes/historico/rs/leite-1l"
pgsession <- html_session(url)               ## create session
pgform    <- html_form(pgsession)[[1]]       ## pull form from session
filled_form <- set_values(pgform,
                          `ctl00$cphConteudo$CpDDLEstado` = "9826", #bahia
                          `ctl00$cphConteudo$CpDDLProduto` = "17") # algodão

submit_form(pgsession,filled_form)

The code returns a link from a blank page.

1 answer

3


This site has a very boring way of requesting POST, but he has the advantage of accepting requisitions GET also. To the GET, it uses a format

http://www.agrolink.com.br/cotacoes/historico/#ESTADO/#NOME_PRODUTO

Testing some I saw that he always uses in #ESTADO, the state acronym in lowercase letter. For the product name, I saw that he changed everything that was not alpha-numeric by -.

So you could convert the name of the products with a function like:

library(stringr)
produtos <- c("Banana Prata Anã Primeira Atacado Cx 20Kg",
              "Cebola Amarela (Ipa) Produtor 1Kg",
              "Açúcar VHP Sc 50Kg"
              )

produtos <- produtos %>% str_replace_all("[:punct:]", "-") %>%
  str_replace_all("[:space:]", "-") %>%
  tolower() %>%
  iconv(to = "ASCII//TRANSLIT")
produtos
[1] "banana-prata-ana-primeira-atacado-cx-20kg" "cebola-amarela--ipa--produtor-1kg"        
[3] "acucar-vhp-sc-50kg"

Then you can request in this way, accessing each of the pages with a loop that traverses the array of states and products:

estados <- c("sp", "mg")
for(estado in estados){
  for(produto in produtos){
    url <- sprintf("http://www.agrolink.com.br/cotacoes/historico/%s/%s", estado, produto)
    tabela <- read_html(url) %>%
      html_nodes("#ctl00_cphConteudo_gvHistorico") %>%
      html_table()
    tabela <- tabela[[1]]
  }
}

Of course this way, you will still need to create a list with the name of the products and a list with the acronym of the states, but I believe it is the easiest way.

  • Very good! It reminded me that I need to study regular expressions.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.