Grab URL table from R

Question

Grab URL table from R

Asked 7 years, 8 months ago

Viewed 758 times

4

I need to do it here:

library(xml)

URL <- "http://globoesporte.globo.com/futebol/brasileirao-serie-a/"

tabela1 <- readHTMLTable(URL, which = 1, colClasses = )

tabela1$V3 <- NULL

names(tabela1) <- c("Posição","Time")

tabela2 <- readHTMLTable(URL, which = 2)

tabela2$`ÚLT. JOGOS` <- NULL

df.data <- bind_cols(tabela1, tabela2)

df.data$Time <- gsub("[A-Z]{3}$","",df.data$Time)

But without using the XML library, because the application I use does not support.

1

The problem is not clear to me. I understood that it is not possible to use the package XML, but this means that none other package can be used? That is, the problem should be solved with the base commands of R? Or is it possible to install other webscraping packages, such as rvest, for example?

– Marcus Nunes

2017/11/09 at 20:29
Briefly, I need to extract table from a URL in Rstudio, I can only use existing packages in R, but not using XML.

– Juny

2017/11/09 at 20:44

1 answer

Browser other questions tagged html r url

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2017-11-10T12:19:13+00:00

As commented by @Marcusnunes, it is possible to do this extraction using the rvest:

library(rvest)
URL <- "http://globoesporte.globo.com/futebol/brasileirao-serie-a/"
tabelas <- read_html(URL) %>% html_table()

tabela1 <- tabelas[[1]]
tabela1[[3]] <- NULL
names(tabela1) <- c("Posição","Time")

tabela2 <- tabelas[[2]]
tabela2$`ÚLT. JOGOS` <- NULL

df.data <- dplyr::bind_cols(tabela1, tabela2)
df.data$Time <- gsub("[A-Z]{3}$","",df.data$Time)

If there is a package use limitation, only with r-base it is possible to create your own html_table2:

pagina <- readLines(URL, encoding = 'UTF-8')

html_table2 <- function(pagina) {
  tabelas <- strsplit(paste(pagina, collapse = "\n"), '<table.*?>')[[1]]
  # remover lixo antes da primeira tabela
  tabelas <- tabelas[-1]

  tabelas_limpas <- lapply(tabelas, function(x) {
    # tirar sujeira posterior a '</table>'
    limpa <- sub('</table>.*', '', x)
  })

  linhas <- lapply(tabelas_limpas, function(x) {
    linhas <- strsplit(x, '<tr.*?>')[[1]]
    sub('</tr.*?>', '', linhas)
  })

  df <- lapply(linhas, function(x) {
    colunas <- lapply(x, function(y) strsplit(y, '<td.*?>')[[1]])
    matriz <- do.call(rbind, colunas)
    limpas <- gsub('</?.+?>', '', matriz) # Remove tags HTML
    limpas <- gsub('\\n', '', limpas)
    limpas <- gsub('[\\s]{2,}', '', limpas)
    # Remove duas primeiras linhas e a primeira e ultima colunas
    limpas <- limpas[- (1:2), -c(1, ncol(limpas))]
    as.data.frame(limpas)
  })
  df
}

tabelas <- html_table2(pagina)
tabela1 <- tabelas[[1]]
names(tabela1) <- c("Posição","Time")

tabela2 <- tabelas[[2]]
names(tabela2) <- c('P', 'J', 'V', 'E', 'D', 'GP', 'GC', 'SG', '%')

df.data <- dplyr::bind_cols(tabela1, tabela2)
df.data$Time <- gsub("[A-Z]{3}$","",df.data$Time)