Web scraping with R

Asked

Viewed 2,014 times

11

I am trying to make a Web Scrapping of the following link: http://empresasdobrasil.com/empresas/alta-floresta-mt/

I want to access all categories and extract a data frame with the name of all companies.

If you click on the name of any of the companies you will get some data like:

  • Fancy name
  • Social Reason
  • Opening date
  • Company status
  • Legal nature
  • Address

I would like to besides the names, how to get this information also.

I tried to use the rvest but I was unsuccessful.

Any idea?

  • Partner, vc has to create a function to scrape the links from the first layer (categories) and for each item in the category do the same for the link companies. I recommend using Rselenium because navigation can be allied to link lists. Loop scrape for each list address. Abs,

  • I was making a script to answer your question, but I ran into it: Screenshot. The site does not allow mass access, and any attempt will cause you to fall into the captcha. Unless you know (or discover) a way to resolve it, the way is to buy the services they offer.

  • Ah, and if you insist, as I was doing here, you’ll get here: http://empresasdobrasil.com/acessoBloqueado. Then it won’t even solve.

3 answers

10


My access was blocked while I was doing, but my code ended up being below. There are some explanations of how each part works. I only used contemporary Hadley Wickham packages, including the rvest that you wanted to use.

Unfortunately the scraper is not so useful because of captcha and locks. The site allows to make budgets here. The code below can be used for other scrapers. I recommend that you do a nicer handling of errors, for example using dplyr::failwith or tryCatch.

library(rvest)
library(httr)
library(tidyr)
library(dplyr)
library(stringr)

#' Tem captcha?
#'
#' Verifica se uma resposta tem captcha
#'
#' @param r resultado de uma request (pacote \code{\link{httr}}).
#'
#' @return \code{TRUE} se tiver captcha e \code{FALSE} caso contrário.
tem_captcha <- function(r) {
  res <- r %>%
    content('text') %>%
    read_html() %>%
    html_nodes(xpath = '//form[@action="/verificarCaptcha/confirmar"]') %>%
    length()
  res > 0
}

bloqueado <- function(r) {
  r %>%
    content('text') %>%
    str_detect('Acesso bloqueado')
}

#' Baixar categorias
#'
#' Baixa as categorias a partir do link inicial
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/
#'
#' @param link URL do município.
#'
#' @return \code{data.frame}
baixar_categorias <- function(link) {
  r <- GET(link)
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  u_base <- 'http://empresasdobrasil.com'
  r %>%
    content('text') %>%
    read_html() %>%
    html_nodes('.container a.linhas') %>% {
      data.frame(tipo = html_text(.),
                 link_categoria = paste0(u_base, html_attr(., 'href')),
                 stringsAsFactors = FALSE)
    } %>%
    mutate(result = 'OK')
}

#' Baixar empresas
#'
#' Baixa as empresas a partir do link de uma categoria
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/hoteis
#'
#' @param link URL da categoria
#'
#' @return \code{data.frame}
baixar_empresas <- function(link) {
  r <- GET(link, write_disk('arq.html', overwrite = TRUE))
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  u_base <- 'http://empresasdobrasil.com'
  r %>%
    content('text') %>%
    read_html() %>%
    html_node('table') %>% {
      tab <- html_table(.) %>%
        setNames('nome_razao') %>%
        separate(nome_razao, c('nome_fantasia', 'razao_social'),
                 sep = ' - ', extra = 'merge', fill = 'left')
      links <- html_nodes(., 'a') %>%
        html_attr('href')
      tab$link_empresa <- paste0(u_base, links)
      tab
    } %>%
    mutate(result = 'OK')
}

#' Baixar infos de uma empresa
#'
#' Baixa as empresas a partir do link de uma categoria
#' ex.: http://empresasdobrasil.com/empresas/alta-floresta-mt/hoteis
#'
#' @param link URL da categoria
#'
#' @return \code{data.frame}
baixar_empresa <- function(link) {
  r <- GET(link)
  if (r$status_code != 200) return(data.frame(result = 'erro'))
  if (tem_captcha(r)) return(data.frame(result = 'captcha'))
  if (bloqueado(r)) return(data.frame(result = 'bloqueado'))
  r %>%
    content('text') %>%
    read_html() %>% {
      data.frame(titulo = html_text(html_nodes(., 'h4')),
                 texto = html_text(html_nodes(., 'h5')),
                 stringsAsFactors = FALSE)
    } %>%
    mutate(result = 'OK')
}

baixar_tudo <- function(link) {
  link <- 'http://empresasdobrasil.com/empresas/alta-floresta-mt/'
  d <- link %>%
    baixar_categorias() %>%
    group_by(tipo, link_categoria) %>%
    do(baixar_empresas(.$link_categoria)) %>%
    ungroup() %>%
    group_by(tipo, link_categoria, nome_fantasia,
             razao_social, link_empresa) %>%
    do(baixar_empresa(.$link_empresa))
  d
}

7

You only need the XML library to scrape the data.

The code below worked to capture information of the first companies. However, I was blocked by too many accesses. If you overcome this barrier, the code works.

First of all, I captured the links of all the categories. Then, in each category, the links of each company. Finally, with a for loop, you scrape the page of each company, extract the data that is in the tag and insert as new line in an empty data frame.

library(XML)

url <- "http://empresasdobrasil.com/empresas/alta-floresta-mt/"
page_source <- xmlRoot(htmlParse(readLines(url)))
links_categorias<- xpathSApply(page_source, "//a[@class = 'linhas']", xmlGetAttr, name = "href")

url_parcial <- "http://empresasdobrasil.com/"
links_empresas <- c()
i = 1
for (categoria in links_categorias){
print(i); i = i + 1
url <- paste0(url_parcial, categoria)
page_source <- xmlRoot(htmlParse(readLines(url)))
links <- xpathSApply(page_source, "//td/a[@href]", xmlGetAttr, name = "href")
  links_empresas <- c(links_empresas, links)
}

i = 1
dados <- data.frame()
for (empresa in links_empresas){
print(i); i = i + 1
url <- paste0(url_parcial, empresa)
page_source <- xmlRoot(htmlParse(readLines(url)))
info_empresa <- xpathSApply(page_source, "//h5", xmlValue)
  dados <- rbind(dados, info_empresa)
}

0

Rselenium is a good idea. Although you don’t need to load a page, just use the functions of the XML package (htmlParse, getNodeSet, xmlValue and xmlGetAttr)

1 - collect all industry links;

2 - collect the links of the companies (need a loop, with links from the previous step)

3 - collect business data (loop with links from previous step)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.