Download data from Stock Exchange tables in R

Asked

Viewed 384 times

2

I have the following code, I need to download the data that is in the table, but the dataframe is always returning empty.

library(tidyverse)
library(rvest)
library(bizdays)
library(dplyr)
library(lubridate)

#### Carregar dados ####

url <- 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/'

html <- read_html(url) 

dados <- html %>%
  html_nodes(xpath = "//table//td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric() %>%
  matrix(ncol = 3, byrow = TRUE) %>%
  as.data.frame()
  • 1

    I’m guessing the B3 website does not allow you to download directly via webscrap. I tried several ways (including with the "rvest" package) and always returns empty. I recommend exploring and trying to download information related to B3 by the "Batchgetsymbols" package. Or copy the table to Excel and read by R.

  • 1

    Thanks for the tips!

  • 1

    @Alexandresanches consider editing the question title to make it more general. Your problem is not limited to the stock exchange site, it occurs on several Urls

2 answers

1


It is easy to notice that there is no error in your code. To see this, just replace your url with a url that has a well formatted table. I use this as an example website. Behold:

library(tidyverse)
library(rvest)
library(bizdays)
library(dplyr)
library(lubridate)


japan<- "http://www.data.jma.go.jp/obd/stats/etrn/view/monthly_s3_en.php?block_no=47401&view=1"
html <- read_html(japan) 

dados_japan <- html %>%
  html_nodes(xpath = "//table//td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric() %>%
  matrix(ncol = 14, byrow = TRUE) %>%
  as.data.frame() %>% setNames(c('Year','Jan','Feb','Mar','Apr',
                                 'May','Jun','Jul','Aug','Sep',
                                 'Oct','Nov','Dec','Annual'))

head(dados_japan)

Output:

  Year Jan  Feb  Mar  Apr  May Jun Jul  Aug  Sep  Oct  Nov  Dec Annual
1   NA  NA 1938 -5.2 -4.9 -0.6 4.7 9.5 11.6 17.9 22.2 16.5 10.7    3.3
2 -4.7 6.8 1939 -7.5 -6.6 -1.4 4.0 7.5 13.0 17.4 20.0 17.4  9.7    3.0
3 -2.5 6.2 1940 -6.0 -5.7 -0.5 3.5 8.5 11.0 16.6 19.7 15.6 10.4    3.7
4 -1.0 6.3 1941 -6.5 -5.8 -2.6 3.6 8.1 11.4 12.7 16.5 16.0 10.0    4.0
5 -2.9 5.4 1942 -7.8 -8.2 -0.8 3.5 7.1 12.0 17.4 18.4 15.7 10.5    2.5
6 -2.9 5.6 1943 -4.1 -6.1 -1.1 3.5 6.9 12.9 19.3 21.5 17.5 11.7    1.2

If the code is right, why the failure to perform the same procedure on the Bovespa website? I believe this is due to the formatting in the source code of the page. Note that if you open the source code of the page of the Japanese Weather Agency used in the example (Crtl+U in google Chrome or BD->View page source), click Ctrl+F and look for the term "table" you will notice that the content of the table of interest is wrapped in a tag "table class". However, when you perform the same procedure on the Ibovespa page there is no tag equivalent. Behind the scenes I believe the function read_html looking for that tag table, if it does not find, the search returns empty and the code fails.

What to do?

Well, in that case you will need to manually find the table and only then use the function read_html with the correct URL inputs and xpath. For this just follow the following steps:

  1. Click with the BD on top of the table
  2. View frame source
  3. Ctrl+F "table". She will be there
  4. Copy the url (without "view-source:") and replace in your code
  5. Go back to the original page
  6. Take the xpath table (BD->Inspect, selects element, BD->copy->xpath)
  7. Put this on xpath in the argument of html_nodes

The code will look like this:

bovespa <- 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp'
html <- read_html(bovespa) 

dados_bov <- html %>%
  html_nodes(xpath = '//*[@id="tb_principal1"]')%>% #aqui você já tem a tabela, daqui para baixo o código apenas padroniza a tabela que vem desestruturada
  html_text()%>%
  str_replace(",", ".") %>%
  str_replace_all('(\r|\n){1,}',';')%>%


dados_bov=strsplit(dados_bov,';')[[1]][5:892]%>% #transforma a string em uma lista e elimina o cabeçalho da base
  matrix(ncol = 3, byrow = TRUE)%>%
  as.data.frame()%>%
  setNames(c("Dias Coriidos", "DI_pre_252", "DI_pre_360"))

head(dados_bov)

Output:

  Dias Coriidos DI_pre_252 DI_pre_360
1             3       4.15       0,00
2             7       4,01       4,10
3            10       4,01       3,43
4            12       4,00       3,81
5            13       4,00       3,95
6            14       4,00       4,08

BD: right click

  • Thank you very much for the excellent explanation!

1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.