Using the code below, why am I only collecting data from the last page of the Loop?

Asked

Viewed 110 times

-5

rm(list=ls())
options(warn=-1)
library("RCurl")
library("XML")

baseurl <- "http://www.gmbahia.ufba.br/index.php/gmbahia/issue/archive?issuesPage=XX#issues"
dados <- data.frame()

for (i in 1:29){
      print(i)
      url <- gsub("XX", i, baseurl)
      url <- xmlRoot(htmlParse(readLines(url)))
      links <- getNodeSet(url, "//a")}

## Links das Teses
links.teses <- xmlSApply(links, xmlGetAttr, name = "href")
links.teses <- grep("view", links.teses, value = T)
links.teses

## Nomes das Edições
teses.titulos <- xmlSApply(links, xmlValue)
teses.titulos <- grep("de", teses.titulos, value = T)
teses.titulos

dados <- rbind(teses.titulos, links.teses)} 
View(dados)
  • I formatted your code and also included the R tag, since you didn’t (I’m waiting for a reviewer to approve). Anyway, review the code posted, because the for opens, but does not close. Take the opportunity to better describe your problem, who knows a little of the algorithm. This will help your question to get more attention from the community. Showing effort and care when formatting a question does well. :-)

  • Thanks Cantoni, even if closing the for, I just pick up the last pages.

  • I think I already have the answer to your question. I’m testing it. Edit it, showing where the for starts and ends. I’ve seen it here, but it’s important to stay documented. Take a brief description of what you’re trying to do. Posting codes like this without saying anything is not cool. I wait here to post the answer.

1 answer

1


See the suggested code below. Now it’s working. The loop needed to be closed (as @Cantoni had already said). But the problem you reported had to do with the fact that you were over-writing the data collected every iteration of for.

What you want is to add lines to the database. Then you have to concatenate the existing data with the new ones. That is, the "data" object has to be cited within rbind. It is a recursive operation: the new value of this data.frame is the same as the old plus updates.

But before that, you have to transform the vectors "teses.titulos" and "links.teses" into two columns -- what I did below with cbind.

# rm(list=ls()) # não é legal colocar no StackOverFlow essa linha...
# options(warn=-1) #não é legal desativar todos os avisos. Coloquei essa opção dentro do comando readLines (abaixo)
#library("RCurl") # desativei -- esse pacote não está sendo usado
library("XML")

baseurl <- "http://www.gmbahia.ufba.br/index.php/gmbahia/issue/archive?issuesPage=XX#issues"

dados <- data.frame()

for (i in 1:10){
  print(i)
  url <- gsub("XX", i, baseurl)
  url <- xmlRoot(htmlParse(readLines(url, warn = F))) # adicionei a opcao de remover avisos aqui
  links <- getNodeSet(url, "//h4/a") #adicionei h4 aqui -- pra pegar só os links de teses

  ## Links das Teses
  links.teses <- xmlSApply(links, xmlGetAttr, name = "href")
  #links.teses <- grep("view", links.teses, value = T) #desativei - linha desnecessária
  #links.teses  #desativei - linha desnecessária

  ## Nomes das Edições
  teses.titulos <- xmlSApply(links, xmlValue)
  #teses.titulos <- grep("de", teses.titulos, value = T) #desativei - linha desnecessária
  #teses.titulos #desativei - linha desnecessária
  dados <- rbind(dados, cbind(teses.titulos, links.teses)) #aqui estava o erro
  } 
  • 1

    I did exactly that, but I was waiting for a position from the author of the question. :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.