Download data from Stock Exchange tables in R

Question

Download data from Stock Exchange tables in R

Asked 5 years, 9 months ago

Viewed 384 times

2

I have the following code, I need to download the data that is in the table, but the dataframe is always returning empty.

library(tidyverse)
library(rvest)
library(bizdays)
library(dplyr)
library(lubridate)

#### Carregar dados ####

url <- 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/'

html <- read_html(url) 

dados <- html %>%
  html_nodes(xpath = "//table//td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric() %>%
  matrix(ncol = 3, byrow = TRUE) %>%
  as.data.frame()

1

I’m guessing the B3 website does not allow you to download directly via webscrap. I tried several ways (including with the "rvest" package) and always returns empty. I recommend exploring and trying to download information related to B3 by the "Batchgetsymbols" package. Or copy the table to Excel and read by R.

– Guilherme Marques

2019/12/26 at 18:00
1

Thanks for the tips!

– Alexandre Sanches

2019/12/27 at 12:37
1

@Alexandresanches consider editing the question title to make it more general. Your problem is not limited to the stock exchange site, it occurs on several Urls

– Lucas

2020/03/07 at 12:39

2 answers

1

It is easy to notice that there is no error in your code. To see this, just replace your url with a url that has a well formatted table. I use this as an example website. Behold:

library(tidyverse)
library(rvest)
library(bizdays)
library(dplyr)
library(lubridate)


japan<- "http://www.data.jma.go.jp/obd/stats/etrn/view/monthly_s3_en.php?block_no=47401&view=1"
html <- read_html(japan) 

dados_japan <- html %>%
  html_nodes(xpath = "//table//td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric() %>%
  matrix(ncol = 14, byrow = TRUE) %>%
  as.data.frame() %>% setNames(c('Year','Jan','Feb','Mar','Apr',
                                 'May','Jun','Jul','Aug','Sep',
                                 'Oct','Nov','Dec','Annual'))

head(dados_japan)

Output:

  Year Jan  Feb  Mar  Apr  May Jun Jul  Aug  Sep  Oct  Nov  Dec Annual
1   NA  NA 1938 -5.2 -4.9 -0.6 4.7 9.5 11.6 17.9 22.2 16.5 10.7    3.3
2 -4.7 6.8 1939 -7.5 -6.6 -1.4 4.0 7.5 13.0 17.4 20.0 17.4  9.7    3.0
3 -2.5 6.2 1940 -6.0 -5.7 -0.5 3.5 8.5 11.0 16.6 19.7 15.6 10.4    3.7
4 -1.0 6.3 1941 -6.5 -5.8 -2.6 3.6 8.1 11.4 12.7 16.5 16.0 10.0    4.0
5 -2.9 5.4 1942 -7.8 -8.2 -0.8 3.5 7.1 12.0 17.4 18.4 15.7 10.5    2.5
6 -2.9 5.6 1943 -4.1 -6.1 -1.1 3.5 6.9 12.9 19.3 21.5 17.5 11.7    1.2

If the code is right, why the failure to perform the same procedure on the Bovespa website? I believe this is due to the formatting in the source code of the page. Note that if you open the source code of the page of the Japanese Weather Agency used in the example (Crtl+U in google Chrome or BD->View page source), click Ctrl+F and look for the term "table" you will notice that the content of the table of interest is wrapped in a tag "table class". However, when you perform the same procedure on the Ibovespa page there is no tag equivalent. Behind the scenes I believe the function read_html looking for that tag table, if it does not find, the search returns empty and the code fails.

What to do?

Well, in that case you will need to manually find the table and only then use the function read_html with the correct URL inputs and xpath. For this just follow the following steps:

Click with the BD on top of the table
View frame source
Ctrl+F "table". She will be there
Copy the url (without "view-source:") and replace in your code
Go back to the original page
Take the xpath table (BD->Inspect, selects element, BD->copy->xpath)
Put this on xpath in the argument of html_nodes

The code will look like this:

bovespa <- 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp'
html <- read_html(bovespa) 

dados_bov <- html %>%
  html_nodes(xpath = '//*[@id="tb_principal1"]')%>% #aqui você já tem a tabela, daqui para baixo o código apenas padroniza a tabela que vem desestruturada
  html_text()%>%
  str_replace(",", ".") %>%
  str_replace_all('(\r|\n){1,}',';')%>%


dados_bov=strsplit(dados_bov,';')[[1]][5:892]%>% #transforma a string em uma lista e elimina o cabeçalho da base
  matrix(ncol = 3, byrow = TRUE)%>%
  as.data.frame()%>%
  setNames(c("Dias Coriidos", "DI_pre_252", "DI_pre_360"))

head(dados_bov)

Output:

  Dias Coriidos DI_pre_252 DI_pre_360
1             3       4.15       0,00
2             7       4,01       4,10
3            10       4,01       3,43
4            12       4,00       3,81
5            13       4,00       3,95
6            14       4,00       4,08

BD: right click

Thank you very much for the excellent explanation!

– Alexandre Sanches

2020/03/09 at 16:57

Browser other questions tagged r web-application web-scraping rvest

You are not signed in. Login or sign up in order to post.

by Zeca • 11 points · Answer 1 · 2020-08-14T20:27:06+00:00

B3 domain pages use iframes to reference data that is still on the server with the BVMF domain.

Just open the Inspector (F12) and perform an iframe search. There is an iframe with id 'bvmf_iframe' where will be the link to the page that is used as source for the table present in the domain B3.com.br.

In your specific case, this is the page:

http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp