How to ignore links that do not fit the established conditions and continue with scraping?

Asked

Viewed 82 times

3

I would like to know how to ignore the links that do not fit the conditions set in title, data_hora and text; thus managing to continue scraping the site.

Error that occurs when a link does not have or does not follow the conditions:"Error in data.frame(title, time_time, text): Arguments imply differing number of Rows: 1, 0"

Below is the script:

# iniciar bibliotecas 
library(XML)
library(xlsx)

#url_base <- "http://www.saocarlosagora.com.br/busca/?q=PDT&page=2"

url_base <- "http://www.saocarlosagora.com.br/busca/?q=bolt&page=koxa"

url_base <- gsub("bolt", "PDT", url_base)

links_saocarlos <- c()
for (i in 1:4){
url1 <- gsub("koxa", i, url_base)
pag<- readLines(url1)
pag<- htmlParse(pag)
pag<- xmlRoot(pag)
links <- xpathSApply(pag, "//div[@class='item']/a", xmlGetAttr, name="href")
links <- paste("http://www.saocarlosagora.com.br/", links, sep ="")
links_saocarlos<- c(links_saocarlos, links)

}

dados <- data.frame()
for(links in links_saocarlos){

pag1<- readLines(links)
pag1<- htmlParse(pag1)
pag1<- xmlRoot(pag1)

    titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
    data_hora <- xpathSApply  (pag1, "//div[@class='horarios']", xmlValue)  
    texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)


dados <- rbind(dados, data.frame(titulo, data_hora, texto))

}  
agregar <- aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')

1 answer

1


In your case, I think a if already solves, for example by replacing the line you place in the database with:

if (length(titulo) == 1 & length(data_hora == 1) & length(texto) == 1){
    dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}

That is, "only add this new line if all the elements of it exist".

However, you could do your scraping in a more robust way as follows:

library(plyr)

raspar <- failwith(NULL, function(links){
  pag1 <- readLines(links)
  pag1 <- htmlParse(pag1)
  pag1 <- xmlRoot(pag1)

  titulo <- xpathSApply(pag1, "//div[@class='row-fluid row-margin']/h2",   xmlValue)
  data_hora <- xpathSApply(pag1, "//div[@class='horarios']", xmlValue)  
  texto <- xpathSApply(pag1, "//div[@id='HOTWordsTxt']/p", xmlValue)

  data.frame(titulo, data_hora, texto)
})

dados <- ldply(links_saocarlos, raspar)

The function failwith captures errors without stopping the execution. This is very good when we are making webscraping, since connection problems are common, for example, which can cause unexpected errors in the code.

Also, use the plyr (function ldply) has some advantages compared to your for. The main one is that you don’t grow the object dynamically, which is usually much faster. Another advantage, is that you can use the argument .progress = "text" and put a progress bar in your code :)

dados <- ldply(links_saocarlos, raspar, .progress = "text")
  • Mto thanks for the help Daniel.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.